Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
64+ messages / 9 participants
[nested] [flat]

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-03-07 22:58 Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Michail Nikolaev @ 2025-03-07 22:58 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased + new parallel GIN builds supported.

Best regards,
MIkhail.

>


Attachments:

  [application/octet-stream] v16-0009-Concurrently-built-index-validation-uses-fresh-s.patch (14.1K, 3-v16-0009-Concurrently-built-index-validation-uses-fresh-s.patch)
  download | inline diff:
From e579bfb2df7bfa0c87d721e05c68c8013915d441 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 17:21:29 +0100
Subject: [PATCH v16 09/12] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/heapam_handler.c | 77 +++++++++++++++---------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 14 +++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 +++++
 8 files changed, 97 insertions(+), 47 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e33345f6a34..54566223cb0 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -868,9 +868,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 6a05620bd67..64c633e0398 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -495,10 +495,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2e3e8a678c9..a596fc9920a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1791,8 +1791,8 @@ heapam_index_build_range_scan(Relation heapRelation,
  */
 static int
 heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
-										   Tuplesortstate  *aux,
-										   Tuplestorestate *store)
+									  Tuplesortstate  *aux,
+									  Tuplestorestate *store)
 {
 	int				num = 0;
 	/* state variables for the merge */
@@ -2048,7 +2048,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot resert at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2059,9 +2060,35 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
-	 * Now take the snapshot that will be used by to filter candidate
-	 * tuples.
+	 * sanity checks
+	 */
+	Assert(OidIsValid(indexRelation->rd_rel->relam));
+
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+															  auxState->tuplesort,
+															  tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
 	 *
 	 * Beware!  There might still be snapshots in use that treat some transaction
 	 * as in-progress that our temporary snapshot treats as committed.
@@ -2077,33 +2104,10 @@ heapam_index_validate_scan(Relation heapRelation,
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
 	limitXmin = snapshot->xmin;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
-	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
-
-	/*
-	 * sanity checks
-	 */
-	Assert(OidIsValid(indexRelation->rd_rel->relam));
-
-	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
-														 auxState->tuplesort,
-														 tuples_for_check);
-
-	/* It is our responsibility to sloe tuple sort as fast as we can */
-	tuplesort_end(state->tuplesort);
-	tuplesort_end(auxState->tuplesort);
-
-	state->tuplesort = auxState->tuplesort = NULL;
-
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2140,6 +2144,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2194,6 +2199,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+
+		if (page_read_counter % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f7914ebb3d0..bb4e0fbb675 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index eeddacd0d52..4130e49dd98 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -190,14 +190,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -811,7 +813,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -925,6 +926,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -965,6 +970,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8df0b472e88..39d2f474865 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3485,8 +3485,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3499,7 +3500,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3582,6 +3583,7 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	 */
 	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3617,6 +3619,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3646,9 +3651,6 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 	}
 	tuplesort_performsort(state.tuplesort);
 	tuplesort_performsort(auxState.tuplesort);
-
-	PopActiveSnapshot();
-	InvalidateCatalogSnapshot();
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 51db7f23378..85f83a97a1f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4349,7 +4349,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v16-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch (109.9K, 4-v16-0008-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-.patch)
  download | inline diff:
From 31b2324ce9aeff3db57e25b53f38d1b6207d2af5 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v16 08/12] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 doc/src/sgml/monitoring.sgml                  |  26 +-
 doc/src/sgml/ref/create_index.sgml            |  33 +-
 doc/src/sgml/ref/reindex.sgml                 |  43 +-
 src/backend/access/heap/README.HOT            |  15 +-
 src/backend/access/heap/heapam_handler.c      | 591 ++++++++++++------
 src/backend/catalog/index.c                   | 312 +++++++--
 src/backend/catalog/system_views.sql          |  17 +-
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 376 ++++++++---
 src/backend/nodes/makefuncs.c                 |   4 +-
 src/include/access/tableam.h                  |  31 +-
 src/include/catalog/index.h                   |  12 +-
 src/include/commands/progress.h               |  13 +-
 src/include/nodes/execnodes.h                 |   4 +-
 src/include/nodes/makefuncs.h                 |   3 +-
 .../expected/cic_reset_snapshots.out          |  28 +
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |  42 ++
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/expected/rules.out           |  17 +-
 src/test/regress/sql/create_index.sql         |  21 +
 21 files changed, 1193 insertions(+), 402 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 16646f560e8..be2d3d5a6db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6253,6 +6253,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6293,13 +6305,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6316,8 +6327,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 208389e8006..e33345f6a34 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -614,25 +614,24 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
-    significantly longer to complete.  However, since it allows normal
+    <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
+    This method requires more total work than a standard index build and takes
+    longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
     and I/O load imposed by the index creation might slow other operations.
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
+    In a concurrent index build, the main and auxiliary indexes is actually entered as an
     <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -645,10 +644,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -658,11 +658,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..6a05620bd67 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,11 +368,10 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
-    rebuild and takes significantly longer to complete as it needs to wait
+    rebuild and takes longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
     it allows normal operations to continue while the index is being rebuilt, this
     method is useful for rebuilding indexes in a production environment. Of
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index fa582d3e2d6..2e3e8a678c9 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,450 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Now take the snapshot that will be used by to filter candidate
+	 * tuples.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to sloe tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
+
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 44e5bc30d3e..8df0b472e88 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +748,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +760,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +797,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1407,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1462,7 +1473,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1484,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2468,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2528,7 +2690,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3284,12 +3447,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3299,18 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (ut these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3493,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3518,27 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,12 +3571,16 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
 
 	/* mark build is concurrent just for consistency */
@@ -3413,15 +3599,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3645,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,8 +3680,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3533,6 +3744,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4020,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4269,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4295,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index a4d2cfdcaf5..ad15db57fd8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1274,16 +1274,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4b50d6ee8cf..51db7f23378 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,10 +588,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -833,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1257,7 +1272,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1599,6 +1615,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1627,11 +1653,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1641,7 +1667,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1680,7 +1706,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1692,15 +1718,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1728,43 +1778,31 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1787,12 +1825,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1817,6 +1855,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3537,6 +3622,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3642,8 +3728,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3695,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3757,6 +3857,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3860,15 +3967,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3919,6 +4029,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3932,12 +4047,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3946,6 +4066,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3964,10 +4085,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4048,13 +4173,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4097,24 +4264,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4129,13 +4324,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4147,16 +4335,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4176,7 +4356,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4266,14 +4446,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4298,6 +4478,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4311,11 +4513,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4335,6 +4537,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 2d0c7a53563..a53779ae2aa 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -787,7 +787,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -803,6 +803,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -828,7 +829,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index b1920999f12..1b2ef8f8002 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -715,11 +715,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1863,22 +1863,25 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both state and auxstate.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..01f85e57ea2 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8c0ad96e02c..4826c1a5538 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index bd5f002cf20..34362e3d875 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3049,6 +3050,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3061,8 +3063,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3090,6 +3094,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 62f69ac20b2..09cfe799efa 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,14 +2020,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index be570da08a0..fcff5d19998 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1250,10 +1251,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1265,6 +1268,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v16-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch (28.7K, 5-v16-0011-Add-proper-handling-of-auxiliary-indexes-during-.patch)
  download | inline diff:
From 1244e33a5e24b4c34529e7a5e5028174480aae49 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v16 11/12] Add proper handling of auxiliary indexes during
 DROP/REINDEX operations

During concurrent index operations, an auxiliary index may be created to help
with the process. In case of error during the building process (for example in case of index constraint violation) such indexes became junk-indexes without any function. This patch improves the handling of such auxiliary indexes:

* Add auxiliaryForIndexId parameter to index_create() to track dependencies
* Automatically drop auxiliary indexes when the main index is dropped
* Delete junk auxiliary indexes properly during REINDEX operations
* Add regression tests to verify new behaviour
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 54566223cb0..fb7cd15f5fe 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -661,10 +661,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 64c633e0398..c6db5d57167 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 39d2f474865..c9eaa169274 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -687,6 +687,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -733,6 +735,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -775,6 +778,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1176,6 +1181,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1458,6 +1472,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1608,6 +1623,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3832,6 +3848,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3888,6 +3905,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4176,7 +4206,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4265,13 +4296,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4297,18 +4345,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 05a63e21475..782aaffa7bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1254,7 +1254,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3588,6 +3588,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 	} ReindexIndexInfo;
@@ -3936,6 +3937,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3943,6 +3945,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4005,12 +4008,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4020,6 +4028,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4040,10 +4049,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4200,7 +4217,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4219,6 +4237,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4401,6 +4422,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4446,6 +4469,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 59156a1c1f6..df152c8466d 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1491,6 +1491,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1551,9 +1553,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1605,6 +1618,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1633,12 +1674,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 01f85e57ea2..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 34362e3d875..8aa6815b37c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3117,20 +3117,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fcff5d19998..5e5cf23d97d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1279,11 +1279,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v16-0012-Updates-index-insert-and-value-computation-logic.patch (2.2K, 6-v16-0012-Updates-index-insert-and-value-computation-logic.patch)
  download | inline diff:
From 827d63ce910b5a7328d4c79dbc8480de60d4fef6 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v16 12/12] Updates index insert and value computation logic to
 optimize auxiliary index handling.

* Skip index value computation for auxiliary indices since they are not needed
* Set indexUnchanged=false for auxiliary indices to avoid unnecessary checks
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c9eaa169274..7ac6d3af606 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2931,6 +2931,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f2a74b76465..eef1b35e68c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -441,11 +441,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v16-0010-Remove-PROC_IN_SAFE_IC-optimization.patch (21.4K, 7-v16-0010-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 704bc5d6ccdba0e5346329642c9755778fd11bec Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v16 10/12] Remove PROC_IN_SAFE_IC optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccf74c0e1b6..1a0f7d13ece 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2891,11 +2891,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index fb4c4a31c74..49a57493340 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2094,11 +2094,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index bb4e0fbb675..4336cdb756c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 85f83a97a1f..05a63e21475 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1187,10 +1180,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1677,10 +1666,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1735,9 +1720,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1767,10 +1749,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1796,9 +1774,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1815,9 +1791,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1858,10 +1831,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1882,10 +1851,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3625,7 +3590,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3997,17 +3961,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4067,7 +4020,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4160,11 +4112,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4195,10 +4142,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4207,11 +4150,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4236,10 +4174,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4259,11 +4193,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4284,10 +4213,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4320,10 +4245,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4351,9 +4272,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4375,13 +4293,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4437,12 +4348,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4504,12 +4409,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4769,36 +4668,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 114eb1f8f76..7f6a9ccf126 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v16-0007-tuplestore-add-support-for-storing-Datum-values.patch (17.3K, 8-v16-0007-tuplestore-add-support-for-storing-Datum-values.patch)
  download | inline diff:
From e6cd266ab480c139cdf69103e7e8f3b18326e93a Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v16 07/12] tuplestore: add support for storing Datum values

Add ability to store and retrieve individual Datum values in tuplestore, optimizing storage based on type:

- Fixed-length: stores raw bytes without length prefix
- Variable-length: includes length prefix/suffix
- By-value types handled inline

This extends tuplestore beyond just handling tuples, planned to be used in next patch.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index d61b601053c..03434f3ea49 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index ed7c454f44e..1f431863387 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v16-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch (39.0K, 9-v16-0005-Allow-snapshot-resets-in-concurrent-unique-index.patch)
  download | inline diff:
From 205008ee146ac36801b9810331226a99448027de Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v16 05/12]  Allow snapshot resets in concurrent unique index  
 builds

 Previously, concurrent unique index builds used a fixed snapshot for the entire
 scan to ensure proper uniqueness checks. This could delay vacuum's ability to
 clean up dead tuples.

 Now reset snapshots periodically during concurrent unique index builds, while
 still maintaining uniqueness by:

 1. Ignoring dead tuples during uniqueness checks in tuplesort
 2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

 This improves vacuum effectiveness during long-running index builds without
 compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 263 insertions(+), 93 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 22274f095ac..fa582d3e2d6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index cbe73675f86..5db6d237c2c 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 810f80fc8e6..f7914ebb3d0 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 693e43c674b..f9695fba8b5 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -51,8 +51,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2828,7 +2826,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -2946,17 +2944,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -2982,6 +2987,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3001,7 +3008,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3012,7 +3019,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3021,6 +3029,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -3029,7 +3039,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -3046,6 +3057,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 482d9a1786d..e369ad0b723 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3301,9 +3301,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 36b875945d3..4b50d6ee8cf 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,8 +1700,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..18f90d46a73 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -359,6 +361,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -401,6 +404,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1655,6 +1659,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1664,18 +1669,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e4fdeca3402..d22a9797ad0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1314,8 +1314,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 313394d92c6..b1920999f12 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1800,9 +1800,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v16-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch (37.0K, 10-v16-0006-Add-STIR-Short-Term-Index-Replacement-access-met.patch)
  download | inline diff:
From a78a10d9d8bce018fc178ef9e3d787fb627daed8 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v16 06/12] Add STIR (Short-Term Index Replacement) access
 method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3b91d02605a..134636c4cc9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3074,6 +3074,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3125,6 +3126,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e369ad0b723..44e5bc30d3e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b5fbdcbd82..9ab60f37570 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index dbbc2f1e30d..2d0c7a53563 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -828,6 +828,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1be8739573f..44f8a0d5606 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 43445cdcc6c..26ddd5ec577 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 134b3dd8689..ac1e7e7c7f2 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a323fa98bbb..8c0ad96e02c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 6543e90de75..fcd8a7c556f 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5136,7 +5136,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5150,7 +5151,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5175,9 +5177,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5186,12 +5188,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5200,7 +5203,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v16-0004-Allow-snapshot-resets-during-parallel-concurrent.patch (41.5K, 11-v16-0004-Allow-snapshot-resets-during-parallel-concurrent.patch)
  download | inline diff:
From 18b28c955ff0e862caae8c54d2cfbc0935fdf50d Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v16 04/12] Allow snapshot resets during parallel concurrent
 index builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 08dc35dd8df..ccf74c0e1b6 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1218,7 +1217,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1251,7 +1249,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1266,6 +1263,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2366,7 +2364,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2397,25 +2394,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2455,8 +2452,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2481,7 +2476,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2527,7 +2523,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2543,6 +2538,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2551,7 +2553,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2574,9 +2577,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2776,14 +2776,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2805,6 +2805,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2945,6 +2946,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f6f40c2f53f..fb4c4a31c74 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
 static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1778,14 +1778,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1808,6 +1808,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2167,6 +2168,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f17c5dbacaa..22274f095ac 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index b490da0eeee..810f80fc8e6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..277c79dd554 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0a153c6f746..482d9a1786d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 6f9e991eeae..bc639964ada 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 3d018c3a1e8..4cd536e988c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index dc6e0184284..8529b808aed 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5393b30c57e..313394d92c6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1181,7 +1181,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1799,9 +1800,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v16-0002-Add-stress-tests-for-concurrent-index-operations.patch (8.1K, 12-v16-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From e5bbf8457ce5947616cfaab2dc59d1099ba58b3c Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v16 02/12] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 190 ++++++++++++++++++++++++++++++++
 2 files changed, 191 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..0d755373ee4
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,190 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for  GIN/GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 4)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIN (ia);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING GIST (p);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING BRIN (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING HASH (updated_at);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v16-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch (46.8K, 13-v16-0003-Allow-advancing-xmin-during-non-unique-non-paral.patch)
  download | inline diff:
From 585c27c1260b7d26c5357933face681a41371804 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v16 03/12] Allow advancing xmin during non-unique,
 non-parallel concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index aac8c74f546..63a08fbe615 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 75a65ec9c75..08dc35dd8df 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1213,11 +1213,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1230,6 +1231,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1249,6 +1251,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2372,6 +2375,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2397,9 +2401,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2442,6 +2453,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2521,6 +2534,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2537,6 +2552,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad880..f6f40c2f53f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 4c83b09edde..0bc93d86460 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -196,6 +196,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fa7935a0ed3..def4fe20d1e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -570,6 +571,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -611,7 +642,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1256,6 +1292,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e78682c3cef..f17c5dbacaa 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 07bae342e25..0d262a4188d 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db90..b490da0eeee 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e1741c81f5..0a153c6f746 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3214,7 +3231,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3277,12 +3295,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6a72e566d4a..36b875945d3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1700,23 +1700,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4079,9 +4073,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4096,7 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36ee6dd43de..e0d82d17918 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6789,6 +6790,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6844,6 +6846,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6901,6 +6908,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 1640d9c32f7..f5bb04d5bd1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 131c050c15f..5393b30c57e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -936,7 +948,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -944,6 +957,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1776,6 +1798,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v16-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (17.5K, 14-v16-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From c5395363e7061de275eb2ad359bc488e4243f71d Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v16 01/12] This is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 6 files changed, 216 insertions(+), 49 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 32ff3ca9a28..6a72e566d4a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1796,6 +1796,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4201,7 +4202,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4280,6 +4281,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 742f3f8c08d..f2a74b76465 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -943,6 +944,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 5cd5e2eeb80..df2420ce8ab 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index b0fe50075ad..d5ad73f6f69 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1158,6 +1159,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 71abb01f655..af7586a428f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8f1508b1ee2..3d018c3a1e8 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
@ 2025-05-18 15:09 ` Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-05-18 15:09 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; Andres Freund <[email protected]>; +Cc: [email protected]; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased version + materials from PGConf.dev 2025 Poster Session :)

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v19-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (28.7K, 2-v19-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From d28e4c7bbc2980c6d43015126ca88bdcdcc05238 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v19 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  19 ++--
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 363 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e7a7a160742..298a093f554 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 57c347f2930..634ba55d184 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -474,14 +474,17 @@ Indexes:
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
-    index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
-    If the invalid index is instead suffixed <literal>ccold</literal>,
-    it corresponds to the original index which could not be dropped;
-    the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    <literal>ccnew</literal>, then it corresponds to the transient index
+    created during the concurrent operation. The recommended recovery
+    method is to drop it using <literal>DROP INDEX</literal>, then attempt
+    <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>ccaux</literal>) will be automatically dropped
+    along with its main index. If the invalid index is instead suffixed
+    <literal>ccold</literal>, it corresponds to the original index which
+    could not be dropped; the recommended recovery method is to just drop
+    said index, since the rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
    </para>
 
    <para>
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c09c6a2b67..bf0bb79474b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3842,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3898,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4186,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4275,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4307,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 65fa7fd74e0..354ce8dd463 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1260,7 +1260,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3639,6 +3639,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3988,6 +3989,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3995,6 +3997,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4068,12 +4071,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4083,6 +4091,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4104,10 +4113,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4288,7 +4305,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4311,6 +4329,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4529,6 +4550,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4580,6 +4603,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ad38247aa..a1043c183f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4713f18e68d..53b2b13efc3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v19-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.3K, 3-v19-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 854a2d3d5b7389f41c3a9392ad603f074fe77b33 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v19 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f58e138eed2..2f066f45c62 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3620,7 +3585,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3994,17 +3958,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4070,7 +4023,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4171,11 +4123,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4207,10 +4154,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4219,11 +4162,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4248,10 +4186,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4271,11 +4205,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4297,10 +4226,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4336,10 +4261,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4367,9 +4288,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4391,13 +4309,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4453,12 +4364,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4522,12 +4427,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4795,36 +4694,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v19-0010-Optimize-auxiliary-index-handling.patch (2.4K, 4-v19-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From c21b40416b5a0b668aa7dbd1fc994c77685fb18a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v19 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bf0bb79474b..d1b96703bbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 499cba145dd..c8b51e2725c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v19-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (96.8K, 5-v19-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 031d66f94c0756133d0da0bed3b946ac588c8b03 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v19 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 292 +++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/catalog/toasting.c             |   3 +-
 src/backend/commands/indexcmds.c           | 337 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |  12 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1104 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 147a8f7587c..e7a7a160742 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5b3c769800e..57c347f2930 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>ccnew</literal>, then it corresponds to the transient
+    <literal>ccnew</literal or <literal>ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..633bc245e28 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..6c09c6a2b67 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3592,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3617,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3756,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4032,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4281,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4307,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..edd61c294a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1288,16 +1288,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 15206d27227..65fa7fd74e0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1267,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1610,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1648,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1662,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1701,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1713,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1773,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1812,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1846,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1871,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3638,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3744,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3804,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3873,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3983,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4045,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4063,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4082,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4101,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4189,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,6 +4281,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4098,12 +4323,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4141,7 +4360,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4170,7 +4389,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4479,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4511,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4546,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4570,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index acd20dbfab8..6c43f47814d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..4713f18e68d 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3850dde4adb..76f25ec686f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -187,8 +187,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v19-0011-Refresh-snapshot-periodically-during-index-valid.patch (32.1K, 6-v19-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 02dc23e508622b19ef2df3df1de763cd37ddb58b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v19 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml            | 11 ++-
 doc/src/sgml/ref/reindex.sgml                 | 11 ++-
 src/backend/access/heap/README.HOT            |  4 +-
 src/backend/access/heap/heapam_handler.c      | 77 ++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c           |  2 +-
 src/backend/access/spgist/spgvacuum.c         | 12 ++-
 src/backend/catalog/index.c                   | 42 +++++++---
 src/backend/commands/indexcmds.c              | 50 ++----------
 src/include/access/tableam.h                  |  7 +-
 src/include/access/transam.h                  | 15 ++++
 src/include/catalog/index.h                   |  2 +-
 .../expected/cic_reset_snapshots.out          | 28 +++++++
 .../sql/cic_reset_snapshots.sql               |  1 +
 13 files changed, 179 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 298a093f554..6220a80474f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 634ba55d184..b887574f106 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -498,10 +498,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 633bc245e28..4456b16df70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,25 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid", NULL);
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d1b96703bbc..e707b012f41 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3534,8 +3534,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3548,7 +3549,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3569,13 +3570,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3625,8 +3627,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3662,6 +3668,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3671,6 +3680,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3690,19 +3702,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3725,6 +3742,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 354ce8dd463..f58e138eed2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4348,7 +4326,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4363,13 +4340,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4381,16 +4351,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4403,7 +4365,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6c43f47814d..d38a6961035 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 53b2b13efc3..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -154,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
-- 
2.43.0



  [application/octet-stream] v19-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.9K, 7-v19-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From bade8833d582234c10aac67fb86cbb3659580718 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v19 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..f05a5eecdda 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5b6cadb5a6c..3850dde4adb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -182,12 +182,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,6 +217,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v19-0007-Add-Datum-storage-support-to-tuplestore.patch (17.3K, 8-v19-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 9fa8406cb66f0dcff6e16e0fe64fd7b6d099f6bf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v19 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v19-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 9-v19-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 52ab4a6c4d7d944c4ca26b800d504c7bf507ef9f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v19 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7d3..d07fe72713d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2532,7 +2530,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3852,7 +3850,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3970,17 +3968,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4006,6 +4011,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4025,7 +4032,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4036,7 +4043,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4045,6 +4053,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4053,7 +4063,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4070,6 +4081,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a93d4f388bc..15206d27227 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a69f71a3ace..acd20dbfab8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v19-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 10-v19-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From 8eaa54a4919625a8fa69854fef670d4b3258bff8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v19 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8df6ba9b89e..a69f71a3ace 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v19-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 11-v19-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 1d5a4fbd43c023b3010c61453b5846801792e0fc Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v19 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ec8cda1c68..10316246e4d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0f75debe7f1..a93d4f388bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 49ad6e83578..ded9eecfbc0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6901,6 +6902,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6956,6 +6958,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7013,6 +7020,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..8df6ba9b89e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v19-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 12-v19-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 52d582222e047be124ff5e9a653178eec085f0f7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v19 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v19-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 13-v19-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v19-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (25.3K, 14-v19-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 9993b0c3dc8df7b3a026e7c8f6a43b5ab592a833 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v19 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/meson.build                   |   1 +
 .../t/006_cic_bt_index_parent_check.pl        |  39 +++++
 contrib/amcheck/verify_nbtree.c               |  68 ++++-----
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 +++++++++++++--
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++++++++-----
 src/backend/utils/time/snapmgr.c              |   2 +
 9 files changed, 285 insertions(+), 88 deletions(-)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
       't/003_cic_2pc.pl',
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
+      't/006_cic_bt_index_parent_check.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d962fe392cd..0f75debe7f1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..499cba145dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 3f8a4cb5244..f1757d02f1c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -487,6 +487,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -697,6 +739,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -707,23 +751,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 46d533b7288..566dbecb390 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1178,6 +1179,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v19-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (28.7K, 15-v19-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v19-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 16-v19-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v19-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (36.9K, 17-v19-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

  [application/octet-stream] v19-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (17.3K, 18-v19-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v19-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (96.9K, 19-v19-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/pdf] STIR-poster.pdf (1.5M, 20-STIR-poster.pdf)
  download

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-05-18 15:56   ` Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Álvaro Herrera @ 2025-05-18 15:56 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello Mihail,

On 2025-May-18, Mihail Nikalayeu wrote:

> Hello, everyone!
> 
> Rebased version + materials from PGConf.dev 2025 Poster Session :)

I agree with Matthias that this work is important, so thank you for
persisting on it.

I didn't understand why you have a few "v19" patches and also a separate
series of "v19-only-part-3-" patches.  Is there duplication?  How do
people know which series comes first?

I think it would be better to get the PDF poster in a wiki page ... in
fact I would suggest to Andrey that he could start a wiki page with all
the PDFs presented at the conference.  Distributing a bunch of 2 MB pdf
via the mailing list doesn't sound too great an idea to me.  A few
people are having trouble with email quotas in cloud services, and the
list server gets bothered because of it.  Kindly don't do that anymore.

Regards

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
@ 2025-05-18 16:09     ` Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-05-18 16:09 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Álvaro!

> I didn't understand why you have a few "v19" patches and also a separate
> series of "v19-only-part-3-" patches.  Is there duplication?  How do
> people know which series comes first?

This was explained in the previous email [0]:

> Patch itself contains 4 parts, some of them may be reviewed/committed
> separately. All commit messages are detailed and contain additional
> explanation of changes.

> To not confuse CFBot, commits are presented in the following way: part
> 1, 2, 3 and 4. If you want only part 3 to test/review – check the
> files with "patch_" extensions. They differ a little bit, but changes
> are minor.

If you have an idea of a better way to handle it, please share. Yes,
the current approach is a bit odd.

> I think it would be better to get the PDF poster in a wiki page ... in
> fact I would suggest to Andrey that he could start a wiki page with all
> the PDFs presented at the conference.  Distributing a bunch of 2 MB pdf
> via the mailing list doesn't sound too great an idea to me.  A few
> people are having trouble with email quotas in cloud services, and the
> list server gets bothered because of it.  Kindly don't do that anymore.

Oh, you're right—I just didn't think of that. My bad, sorry about that.

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-05-23 21:59       ` Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-05-23 21:59 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Attachments:

  [application/octet-stream] v20-only-part-3-0002-Use-auxiliary-indexes-for-concurrent.patch_ (96.9K, 2-v20-only-part-3-0002-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v20-only-part-3-0005-Refresh-snapshot-periodically-during.patch_ (20.7K, 3-v20-only-part-3-0005-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v20-only-part-3-0004-Optimize-auxiliary-index-handling.patch_ (2.4K, 4-v20-only-part-3-0004-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v20-only-part-3-0003-Track-and-drop-auxiliary-indexes-in-.patch_ (28.3K, 5-v20-only-part-3-0003-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v20-only-part-3-0001-Add-Datum-storage-support-to-tuplest.patch_ (17.3K, 6-v20-only-part-3-0001-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v20-0011-Refresh-snapshot-periodically-during-index-valid.patch (32.1K, 7-v20-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From bc8f8fb41c54b03f7298396f24bd2e007e327aa9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v20 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml            | 11 ++-
 doc/src/sgml/ref/reindex.sgml                 | 11 ++-
 src/backend/access/heap/README.HOT            |  4 +-
 src/backend/access/heap/heapam_handler.c      | 77 ++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c           |  2 +-
 src/backend/access/spgist/spgvacuum.c         | 12 ++-
 src/backend/catalog/index.c                   | 42 +++++++---
 src/backend/commands/indexcmds.c              | 50 ++----------
 src/include/access/tableam.h                  |  7 +-
 src/include/access/transam.h                  | 15 ++++
 src/include/catalog/index.h                   |  2 +-
 .../expected/cic_reset_snapshots.out          | 28 +++++++
 .../sql/cic_reset_snapshots.sql               |  1 +
 13 files changed, 179 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 298a093f554..6220a80474f 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 633bc245e28..4456b16df70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,25 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid", NULL);
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d186ce9ec37..8d755470e8c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d1b96703bbc..e707b012f41 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3534,8 +3534,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3548,7 +3549,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3569,13 +3570,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3625,8 +3627,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3662,6 +3668,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 
@@ -3671,6 +3680,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3690,19 +3702,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3725,6 +3742,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 354ce8dd463..f58e138eed2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4348,7 +4326,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4363,13 +4340,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4381,16 +4351,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4403,7 +4365,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6c43f47814d..d38a6961035 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 53b2b13efc3..8fe0acc1e6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -154,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
-- 
2.43.0



  [application/octet-stream] v20-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.3K, 8-v20-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 4058b4ad87706a184fdae7b1c0d6eb43b267ea7f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v20 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 423424e51a2..93ad3f3f632 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8d755470e8c..00c86bfcfc6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f58e138eed2..2f066f45c62 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3620,7 +3585,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3994,17 +3958,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4070,7 +4023,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4171,11 +4123,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4207,10 +4154,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4219,11 +4162,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4248,10 +4186,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4271,11 +4205,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4297,10 +4226,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4336,10 +4261,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4367,9 +4288,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4391,13 +4309,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4453,12 +4364,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4522,12 +4427,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4795,36 +4694,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v20-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (28.3K, 9-v20-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 6fbab3cb700228df664a593dd973d90872c788be Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v20 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  64 ++++++++++---
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   2 +-
 src/backend/commands/indexcmds.c           |  35 ++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/include/catalog/dependency.h           |   1 +
 src/include/catalog/index.h                |   1 +
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 12 files changed, 358 insertions(+), 36 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index e7a7a160742..298a093f554 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c09c6a2b67..bf0bb79474b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -688,6 +688,8 @@ UpdateIndexRelation(Oid indexoid,
  *		parent index; otherwise InvalidOid.
  * parentConstraintId: if creating a constraint on a partition, the OID
  *		of the constraint in the parent; otherwise InvalidOid.
+ * auxiliaryForIndexId: if creating auxiliary index, the OID of the main
+ *		index; otherwise InvalidOid.
  * relFileNumber: normally, pass InvalidRelFileNumber to get new storage.
  *		May be nonzero to attach an existing valid build.
  * indexInfo: same info executor uses to insert into the index
@@ -734,6 +736,7 @@ index_create(Relation heapRelation,
 			 Oid indexRelationId,
 			 Oid parentIndexRelid,
 			 Oid parentConstraintId,
+			 Oid auxiliaryForIndexId,
 			 RelFileNumber relFileNumber,
 			 IndexInfo *indexInfo,
 			 const List *indexColNames,
@@ -776,6 +779,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* auxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(auxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1177,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(auxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, auxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1459,6 +1473,7 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  InvalidOid,	/* indexRelationId */
 							  InvalidOid,	/* parentIndexRelid */
 							  InvalidOid,	/* parentConstraintId */
+							  InvalidOid,	/* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -1609,6 +1624,7 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							  InvalidOid,    /* indexRelationId */
 							  InvalidOid,    /* parentIndexRelid */
 							  InvalidOid,    /* parentConstraintId */
+							  mainIndexId,   /* auxiliaryForIndexId */
 							  InvalidRelFileNumber, /* relFileNumber */
 							  newInfo,
 							  indexColNames,
@@ -3842,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3898,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4186,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4275,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4307,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 0ee2fd5e7de..0ee8cbf4ca6 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -319,7 +319,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	coloptions[1] = 0;
 
 	index_create(toast_rel, toast_idxname, toastIndexOid, InvalidOid,
-				 InvalidOid, InvalidOid,
+				 InvalidOid, InvalidOid, InvalidOid,
 				 indexInfo,
 				 list_make2("chunk_id", "chunk_seq"),
 				 BTREE_AM_OID,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 65fa7fd74e0..354ce8dd463 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1260,7 +1260,7 @@ DefineIndex(Oid tableId,
 
 	indexRelationId =
 		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
-					 parentConstraintId,
+					 parentConstraintId, InvalidOid,
 					 stmt->oldNumber, indexInfo, indexColNames,
 					 accessMethodId, tablespaceId,
 					 collationIds, opclassIds, opclassOptions,
@@ -3639,6 +3639,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3988,6 +3989,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -3995,6 +3997,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4068,12 +4071,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4083,6 +4091,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4104,10 +4113,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4288,7 +4305,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4311,6 +4329,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4529,6 +4550,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4580,6 +4603,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 54ad38247aa..a1043c183f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4713f18e68d..53b2b13efc3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -73,6 +73,7 @@ extern Oid	index_create(Relation heapRelation,
 						 Oid indexRelationId,
 						 Oid parentIndexRelid,
 						 Oid parentConstraintId,
+						 Oid auxiliaryForIndexId,
 						 RelFileNumber relFileNumber,
 						 IndexInfo *indexInfo,
 						 const List *indexColNames,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v20-0010-Optimize-auxiliary-index-handling.patch (2.4K, 10-v20-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From efd01b195da4b23dc1dc76c44f4f671a8427936b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v20 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bf0bb79474b..d1b96703bbc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2932,6 +2932,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 499cba145dd..c8b51e2725c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v20-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 11-v20-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 411431eba585d5502c0c9d16376e41a2258590be Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v20 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 7273b1aee00..0eaa4df5582 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2f45ae96c0c..d186ce9ec37 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7d3..d07fe72713d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2532,7 +2530,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3852,7 +3850,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3970,17 +3968,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4006,6 +4011,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4025,7 +4032,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4036,7 +4043,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4045,6 +4053,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4053,7 +4063,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4070,6 +4081,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6432ef55cdc..cca1dbb8e37 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a93d4f388bc..15206d27227 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588d3..38471e90a0c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1339,8 +1339,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a69f71a3ace..acd20dbfab8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v20-0007-Add-Datum-storage-support-to-tuplestore.patch (17.3K, 12-v20-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From a5b799a999f8cc7dbc934454f0d47dff14c7fda6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v20 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 270 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 ++--
 2 files changed, 244 insertions(+), 59 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..12ae705c091 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -776,6 +831,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1030,7 +1104,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			*should_free = true;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1133,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1164,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1226,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1556,25 +1649,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1659,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1718,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.9K, 13-v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From b81c096c3aafefb4591eefbf5e60d378050ec309 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v20 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 573 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 777 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f28326bad09..232c87ec267 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3092,6 +3092,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3143,6 +3144,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..01f3b660f4b
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,573 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cca1dbb8e37..e9e22ec0e84 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 2b9d548cdeb..286fcccec3d 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 62beb71da28..f05a5eecdda 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213f..0341bb74325 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -181,12 +181,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -215,6 +216,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v20-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (96.8K, 14-v20-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 41f6ddbb909c6ac2fc408030805f0312d474b709 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v20 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 292 +++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/catalog/toasting.c             |   3 +-
 src/backend/commands/indexcmds.c           | 337 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |  12 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1104 insertions(+), 351 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 147a8f7587c..e7a7a160742 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0eaa4df5582..633bc245e28 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9e22ec0e84..6c09c6a2b67 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -744,7 +749,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -755,11 +761,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +798,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1408,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1463,7 +1474,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1473,6 +1485,155 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3469,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3493,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3515,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3540,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3592,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3617,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3756,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4032,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4281,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4307,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 15efb02badb..edd61c294a6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1288,16 +1288,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..0ee2fd5e7de 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 15206d27227..65fa7fd74e0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1251,7 +1267,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1593,6 +1610,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1648,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1662,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1701,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1713,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1773,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1812,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1846,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1871,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3531,6 +3638,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3636,8 +3744,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3689,8 +3804,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3751,6 +3873,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3854,15 +3983,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3913,6 +4045,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3926,12 +4063,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3940,6 +4082,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3958,10 +4101,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4042,13 +4189,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4091,6 +4281,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4098,12 +4323,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4141,7 +4360,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4170,7 +4389,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4260,14 +4479,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4292,6 +4511,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4305,11 +4546,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4329,6 +4570,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index acd20dbfab8..6c43f47814d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..4713f18e68d 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,6 +103,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +153,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0341bb74325..e02fc6aa3e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -186,8 +186,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v20-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 15-v20-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 7a9042056dec25923c166bee36b72e1b3573c5d7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v20 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 01e1db7f856..e5a945a1b14 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9ec8cda1c68..10316246e4d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -612,6 +613,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -653,7 +684,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1304,6 +1340,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ac082fefa77..8a584db595a 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 3794cc924ad..f3986d086b6 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 739a92bdcc1..cbd0ba9aa01 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0f75debe7f1..a93d4f388bc 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4073,9 +4067,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4090,7 +4081,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ff65867eebe..0d5e54e0cc2 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6899,6 +6900,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6954,6 +6956,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7011,6 +7018,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e48fe434cd3..6caad42ea4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..8df6ba9b89e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v20-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 16-v20-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From d85b235e8917062dd2d62a008003b89ed035917e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v20 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e5a945a1b14..423424e51a2 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8a584db595a..7273b1aee00 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f3986d086b6..2f45ae96c0c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cbd0ba9aa01..6432ef55cdc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8df6ba9b89e..a69f71a3ace 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v20-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 17-v20-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 62602601260a531754108a9e00eeb863d98b3eac Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v20 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v20-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (25.3K, 18-v20-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From fc9f12ce38e1c50b21fb48b244da51eba3072536 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v20 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/meson.build                   |   1 +
 .../t/006_cic_bt_index_parent_check.pl        |  39 +++++
 contrib/amcheck/verify_nbtree.c               |  68 ++++-----
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 +++++++++++++--
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++++++++-----
 src/backend/utils/time/snapmgr.c              |   2 +
 9 files changed, 285 insertions(+), 88 deletions(-)
 create mode 100644 contrib/amcheck/t/006_cic_bt_index_parent_check.pl

diff --git a/contrib/amcheck/meson.build b/contrib/amcheck/meson.build
index b33e8c9b062..b040000dd55 100644
--- a/contrib/amcheck/meson.build
+++ b/contrib/amcheck/meson.build
@@ -49,6 +49,7 @@ tests += {
       't/003_cic_2pc.pl',
       't/004_verify_nbtree_unique.pl',
       't/005_pitr.pl',
+      't/006_cic_bt_index_parent_check.pl',
     ],
   },
 }
diff --git a/contrib/amcheck/t/006_cic_bt_index_parent_check.pl b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
new file mode 100644
index 00000000000..6e52c5e39ec
--- /dev/null
+++ b/contrib/amcheck/t/006_cic_bt_index_parent_check.pl
@@ -0,0 +1,39 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Test bt_index_parent_check with index created with CREATE INDEX CONCURRENTLY
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('CIC_bt_index_parent_check_test');
+$node->init;
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key)));
+# Insert two rows into index
+$node->safe_psql('postgres', q(INSERT INTO tbl SELECT i FROM generate_series(1, 2) s(i);));
+
+# start background transaction
+my $in_progress_h = $node->background_psql('postgres');
+$in_progress_h->query_safe(q(BEGIN; SELECT pg_current_xact_id();));
+
+# delete one row from table, while background transaction is in progress
+$node->safe_psql('postgres', q(DELETE FROM tbl WHERE i = 1;));
+# create index concurrently, which will skip the deleted row
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i);));
+
+# check index using bt_index_parent_check
+$result = $node->psql('postgres', q(SELECT bt_index_parent_check('idx', heapallindexed => true)));
+is($result, '0', 'bt_index_parent_check for CIC after removed row');
+
+$in_progress_h->quit;
+done_testing();
diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d962fe392cd..0f75debe7f1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4195,7 +4196,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4274,6 +4275,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..499cba145dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 514eae1037d..8851f0fda06 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -486,6 +486,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -696,6 +738,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -706,23 +750,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 46d533b7288..566dbecb390 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1178,6 +1179,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-06-16 16:17         ` Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-06-16 16:17 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hey Mihail,

I've started looking at the patches today, mostly the STIR part. Seems
solid, but I've got a question about validation. Why are we still grabbing
tids from the main index and sorting them?

I think it's to avoid duplicate errors when adding tuples from STIP to the
main index, but couldn't we just suppress that error during validation and
skip the new tuple insertion if it already exists?

The main index may get huge after building, and iterating over it in a
single thread and then sorting tids can be time consuming.

At least I guess one can skip it when STIP is empty. But, I think we could
skip it altogether by figuring out what to do with duplicates, making
concurrent and non-concurrent index creation almost identical in speed
(only locking and atomicity would differ).

p.s. I noticed that `stip.c` has a lot of functions that don't follow the
Postgres coding style of return type on separate line.

On Mon, Jun 16, 2025, 6:41 PM Mihail Nikalayeu <[email protected]>
wrote:

> Hello, everyone!
>
> Rebased, patch structure and comments available here [0]. Quick
> introduction poster - here [1].
>
> Best regards,
> Mikhail.
>
> [0]:
> https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
> [1]:
> https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf
>

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-06-16 20:00           ` Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-06-16 20:00 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Sergey!

> I think it's to avoid duplicate errors when adding tuples from STIP to the main index,
> but couldn't we just suppress that error during validation and skip the new tuple insertion if it already exists?

In some cases, it is not possible:
– Some index types (GiST, GIN, BRIN) do not provide an easy way to
detect such duplicates.
– When we are building a unique index, we cannot simply skip
duplicates, because doing so would also skip the rows that should
prevent the unique index from being created (unless we add extra logic
for B-tree indexes to compare TIDs as well).

> The main index may get huge after building, and iterating over it in a single thread and then sorting tids can be time consuming.
My tests indicate that the overhead is minor compared with the time
spent scanning the heap and building the index itself.

> At least I guess one can skip it when STIP is empty.
Yes, that’s a good idea; I’ll add it later.

> p.s. I noticed that `stip.c` has a lot of functions that don't follow the Postgres coding style of return type on separate line.
Hmm... I’ll fix that as well.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-06-16 20:21             ` Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-06-16 20:21 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Thank you for the information. Tomorrow, I will also run a few tests to
measure the time required to collect tids from the index; however, since I
do not work with vanilla postgres, the results may vary.

If the results indicate that this procedure is time-consuming, I maybe will
develop an additional patch specifically for b-tree indexes, as they are
the default and most commonly used type.

Best regards,
Sergey


On Mon, Jun 16, 2025, 11:01 PM Mihail Nikalayeu <[email protected]>
wrote:

> Hello, Sergey!
>
> > I think it's to avoid duplicate errors when adding tuples from STIP to
> the main index,
> > but couldn't we just suppress that error during validation and skip the
> new tuple insertion if it already exists?
>
> In some cases, it is not possible:
> – Some index types (GiST, GIN, BRIN) do not provide an easy way to
> detect such duplicates.
> – When we are building a unique index, we cannot simply skip
> duplicates, because doing so would also skip the rows that should
> prevent the unique index from being created (unless we add extra logic
> for B-tree indexes to compare TIDs as well).
>
> > The main index may get huge after building, and iterating over it in a
> single thread and then sorting tids can be time consuming.
> My tests indicate that the overhead is minor compared with the time
> spent scanning the heap and building the index itself.
>
> > At least I guess one can skip it when STIP is empty.
> Yes, that’s a good idea; I’ll add it later.
>
> > p.s. I noticed that `stip.c` has a lot of functions that don't follow
> the Postgres coding style of return type on separate line.
> Hmm... I’ll fix that as well.
>
> Best regards,
> Mikhail.
>


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-06-17 15:55               ` Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-06-17 15:55 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello Mihail,

In patch v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch,
within the "StirMarkAsSkipInserts" function, a critical section appears to
be left unclosed. This resulted in an assertion failure during ANALYZE of a
table containing a leftover STIR index.

Best regards,
Sergey

On Mon, Jun 16, 2025, 11:21 PM Sergey Sargsyan <
[email protected]> wrote:

> Thank you for the information. Tomorrow, I will also run a few tests to
> measure the time required to collect tids from the index; however, since I
> do not work with vanilla postgres, the results may vary.
>
> If the results indicate that this procedure is time-consuming, I maybe
> will develop an additional patch specifically for b-tree indexes, as they
> are the default and most commonly used type.
>
> Best regards,
> Sergey
>
>
> On Mon, Jun 16, 2025, 11:01 PM Mihail Nikalayeu <[email protected]>
> wrote:
>
>> Hello, Sergey!
>>
>> > I think it's to avoid duplicate errors when adding tuples from STIP to
>> the main index,
>> > but couldn't we just suppress that error during validation and skip the
>> new tuple insertion if it already exists?
>>
>> In some cases, it is not possible:
>> – Some index types (GiST, GIN, BRIN) do not provide an easy way to
>> detect such duplicates.
>> – When we are building a unique index, we cannot simply skip
>> duplicates, because doing so would also skip the rows that should
>> prevent the unique index from being created (unless we add extra logic
>> for B-tree indexes to compare TIDs as well).
>>
>> > The main index may get huge after building, and iterating over it in a
>> single thread and then sorting tids can be time consuming.
>> My tests indicate that the overhead is minor compared with the time
>> spent scanning the heap and building the index itself.
>>
>> > At least I guess one can skip it when STIP is empty.
>> Yes, that’s a good idea; I’ll add it later.
>>
>> > p.s. I noticed that `stip.c` has a lot of functions that don't follow
>> the Postgres coding style of return type on separate line.
>> Hmm... I’ll fix that as well.
>>
>> Best regards,
>> Mikhail.
>>
>


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-06-18 10:49                 ` Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-06-18 10:49 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Sergey!

> In patch v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch, within the "StirMarkAsSkipInserts" function, a critical section appears to be left unclosed. This resulted in an assertion failure during ANALYZE of a table containing a leftover STIR index.
Thanks, good catch. I'll add it to batch fix with the other things.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-06-18 16:33                   ` Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-06-18 16:33 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hi,

Today I encountered a segmentation fault caused by the patch
v20-0007-Add-Datum-storage-support-to-tuplestore.patch. During the merge
phase, I inserted some tuples into the table so that STIR would have data
for the validation phase. The segfault occurred during a call to
tuplestore_end().

The root cause is that a few functions in the tuplestore code still assume
that all stored data is a pointer and thus attempt to pfree it. This
assumption breaks when datumByVal is used, as the data is stored directly
and not as a pointer. In particular, tuplestore_end(), tuplestore_trim(),
and tuplestore_clear() incorrectly try to free such values.

When addressing this, please also ensure that context memory accounting is
handled properly: we should not increment or decrement the remaining
context memory size when cleaning or trimming datumByVal entries, since no
actual memory was allocated for them.

Interestingly, I’m surprised you haven’t hit this segfault yourself. Are
you perhaps testing on an older system where INT8OID is passed by
reference? Or is your STIR always empty during the validation phase?

One more point: I noticed you modified the index_create() function
signature. You added the relpersistence parameter, which seems
unnecessary—this can be determined internally by checking if it’s an
auxiliary index, in which case the index should be marked as unlogged. You
also added an auxiliaryIndexOfOid argument (do not remember exact naming,
but was used for dependency). It might be cleaner to pass this via the
IndexInfo structure instead. index_create() already has dozens of mouthful
arguments, and external extensions (like pg_squeeze) still rely on the old
signature, so minimizing changes to the function interface would improve
compatibility.

Best regards,
Sergey

On Wed, Jun 18, 2025, 1:50 PM Mihail Nikalayeu <[email protected]>
wrote:

> Hello, Sergey!
>
> > In patch
> v20-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch, within the
> "StirMarkAsSkipInserts" function, a critical section appears to be left
> unclosed. This resulted in an assertion failure during ANALYZE of a table
> containing a leftover STIR index.
> Thanks, good catch. I'll add it to batch fix with the other things.
>
> Best regards,
> Mikhail.
>

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-06-18 21:15                     ` Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-06-18 21:15 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Sergey!

> Today I encountered a segmentation fault caused by the patch v20-0007-Add-Datum-storage-support-to-tuplestore.patch. During the merge phase, I inserted some tuples into the table so that STIR would have data for the validation phase. The segfault occurred during a call to tuplestore_end().
>
> The root cause is that a few functions in the tuplestore code still assume that all stored data is a pointer and thus attempt to pfree it. This assumption breaks when datumByVal is used, as the data is stored directly and not as a pointer. In particular, tuplestore_end(), tuplestore_trim(), and tuplestore_clear() incorrectly try to free such values.
>
> When addressing this, please also ensure that context memory accounting is handled properly: we should not increment or decrement the remaining context memory size when cleaning or trimming datumByVal entries, since no actual memory was allocated for them.
>
> Interestingly, I’m surprised you haven’t hit this segfault yourself. Are you perhaps testing on an older system where INT8OID is passed by reference? Or is your STIR always empty during the validation phase?

Thanks for pointing that out. It looks like tuplestore_trim and
tuplestore_clear are broken, while tuplestore_end seems to be correct
but fails due to previous heap corruption.
In my case, tuplestore_trim and tuplestore_clear aren't called at all
- that's why the issue wasn't detected. I'm not sure why; perhaps some
recent changes in your codebase are affecting that?

Please run a stress test (if you've already applied the in-place fix
for the tuplestore):
         ninja && meson test --suite setup && meson test
--print-errorlogs --suite pg_amcheck *006*

This will help ensure everything else is working correctly on your system.

> One more point: I noticed you modified the index_create() function signature. You added the relpersistence parameter, which seems unnecessary—
> this can be determined internally by checking if it’s an auxiliary index, in which case the index should be marked as unlogged. You also added an
> auxiliaryIndexOfOid argument (do not remember exact naming, but was used for dependency). It might be cleaner to pass this via the IndexInfo structure
> instead. index_create() already has dozens of mouthful arguments, and external extensions
> (like pg_squeeze) still rely on the old signature, so minimizing changes to the function interface would improve compatibility.

Yes, that’s probably a good idea. I was trying to keep it simple from
the perspective of parameters to avoid dealing with some of the tricky
internal logic.
But you're right - it’s better to stick with the old signature.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-06-18 21:36                       ` Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-06-18 21:36 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

My bad, my fork's based on pg15, and over there tuplestore_end() does this,

void
tuplestore_end(Tuplestorestate *state)
{
int i;

if (state->myfile)
BufFileClose(state->myfile);
if (state->memtuples)
{
for (i = state->memtupdeleted; i < state->memtupcount; i++)
pfree(state->memtuples[i]);
pfree(state->memtuples);
}
pfree(state->readptrs);
pfree(state);
}

It lets each tuple go one by one, but in pg18, it just nukes the whole
memory context instead.

Therefore, over pg18 patch presents no issues; however, incorporating
`_clean` and `_trim` functions for datum cases is recommended to prevent
future developers from encountering segmentation faults when utilizing the
interface. This minor adjustment should ensure expected functionality.

Best regards,
S

On Thu, Jun 19, 2025, 12:16 AM Mihail Nikalayeu <[email protected]>
wrote:

> Hello, Sergey!
>
> > Today I encountered a segmentation fault caused by the patch
> v20-0007-Add-Datum-storage-support-to-tuplestore.patch. During the merge
> phase, I inserted some tuples into the table so that STIR would have data
> for the validation phase. The segfault occurred during a call to
> tuplestore_end().
> >
> > The root cause is that a few functions in the tuplestore code still
> assume that all stored data is a pointer and thus attempt to pfree it. This
> assumption breaks when datumByVal is used, as the data is stored directly
> and not as a pointer. In particular, tuplestore_end(), tuplestore_trim(),
> and tuplestore_clear() incorrectly try to free such values.
> >
> > When addressing this, please also ensure that context memory accounting
> is handled properly: we should not increment or decrement the remaining
> context memory size when cleaning or trimming datumByVal entries, since no
> actual memory was allocated for them.
> >
> > Interestingly, I’m surprised you haven’t hit this segfault yourself. Are
> you perhaps testing on an older system where INT8OID is passed by
> reference? Or is your STIR always empty during the validation phase?
>
> Thanks for pointing that out. It looks like tuplestore_trim and
> tuplestore_clear are broken, while tuplestore_end seems to be correct
> but fails due to previous heap corruption.
> In my case, tuplestore_trim and tuplestore_clear aren't called at all
> - that's why the issue wasn't detected. I'm not sure why; perhaps some
> recent changes in your codebase are affecting that?
>
> Please run a stress test (if you've already applied the in-place fix
> for the tuplestore):
>          ninja && meson test --suite setup && meson test
> --print-errorlogs --suite pg_amcheck *006*
>
> This will help ensure everything else is working correctly on your system.
>
> > One more point: I noticed you modified the index_create() function
> signature. You added the relpersistence parameter, which seems unnecessary—
> > this can be determined internally by checking if it’s an auxiliary
> index, in which case the index should be marked as unlogged. You also added
> an
> > auxiliaryIndexOfOid argument (do not remember exact naming, but was used
> for dependency). It might be cleaner to pass this via the IndexInfo
> structure
> > instead. index_create() already has dozens of mouthful arguments, and
> external extensions
> > (like pg_squeeze) still rely on the old signature, so minimizing changes
> to the function interface would improve compatibility.
>
> Yes, that’s probably a good idea. I was trying to keep it simple from
> the perspective of parameters to avoid dealing with some of the tricky
> internal logic.
> But you're right - it’s better to stick with the old signature.
>
> Best regards,
> Mikhail.
>


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-06-21 20:32                         ` Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-06-21 20:32 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, Sergey!

I have addressed your comments:
* skip TID scan in case of empty STIR index
* fix for critical section
* formatting
* index_create signature


Rebased, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]:
https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Attachments:

  [application/x-patch] v21-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.3K, 3-v21-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 22e50f2992fb2589d3c4440c13f2e776f2587fd2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v21 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 947dc79b138..a59f84a4251 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 250d9d59b9a..f80379618b2 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dd280a38c39..0478c9ac7b1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1182,10 +1175,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3654,7 +3619,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4028,17 +3992,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4104,7 +4057,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4205,11 +4157,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4241,10 +4188,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4253,11 +4196,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4282,10 +4220,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4305,11 +4239,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4331,10 +4260,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4370,10 +4295,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4401,9 +4322,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4425,13 +4343,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4487,12 +4398,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4556,12 +4461,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4829,36 +4728,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 19d26408c2a..82acf3006bd 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
+REGRESS = injection_points hashagg cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 8476bfe72a7..bddf22df3ac 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/x-patch] v21-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (95.5K, 4-v21-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 4efdbbc5aad10ddb0b28260018785b81b1b7c1b9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v21 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 334 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/execnodes.h              |   4 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 18 files changed, 1122 insertions(+), 345 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58ffa4306e2..f592b09ec68 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5bdb577624c..6c8151e538a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2632,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2693,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3495,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3517,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3542,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3594,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3619,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3690,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3722,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3783,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 08f780a2e63..b20decd1204 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1283,16 +1283,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 40154e5b2bb..1b8438d3187 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1593,6 +1609,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1647,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1661,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1700,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1712,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1772,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1811,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1845,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1870,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3565,6 +3671,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3670,8 +3777,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3723,8 +3837,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3785,6 +3906,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3888,15 +4016,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3947,6 +4078,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3960,12 +4096,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3974,6 +4115,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3992,10 +4134,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4076,13 +4222,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4125,6 +4314,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4132,12 +4356,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4175,7 +4393,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4204,7 +4422,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4294,14 +4512,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4326,6 +4544,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4339,11 +4579,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4363,6 +4603,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index acd20dbfab8..6c43f47814d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0341bb74325..e02fc6aa3e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -186,8 +186,8 @@ typedef struct ExprState
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/x-patch] v21-0011-Refresh-snapshot-periodically-during-index-valid.patch (23.3K, 5-v21-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 462acf9676680d878da365d86405ba59ebe4429d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v21 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             |  7 +--
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 11 files changed, 146 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index cf14f474946..1626cee7a03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f592b09ec68..236e216d170 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 08a3cb28348..250d9d59b9a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b37684309e3..e20d8a60357 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3535,8 +3535,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3549,7 +3550,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3570,13 +3571,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3626,8 +3628,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3663,6 +3669,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3697,6 +3706,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3716,19 +3728,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3751,6 +3768,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7a42f2d815a..dd280a38c39 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4382,7 +4360,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4397,13 +4374,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4415,16 +4385,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4437,7 +4399,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6c43f47814d..d38a6961035 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/x-patch] v21-0010-Optimize-auxiliary-index-handling.patch (2.4K, 6-v21-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 7d3889a10b789dbda99e52cc3c9ffc53886a4de4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v21 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 97e5d2d68aa..b37684309e3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2933,6 +2933,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 499cba145dd..c8b51e2725c 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/x-patch] v21-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 7-v21-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From ba73159f3f8e91fd7ea4413ae9e5758e002ebeae Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v21 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 18316a3968b..ab4c3e2fb4a 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c8151e538a..97e5d2d68aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2633,7 +2646,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2694,7 +2708,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3869,6 +3884,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3941,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4242,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4332,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4381,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 1b8438d3187..7a42f2d815a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -246,7 +246,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -944,7 +944,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3672,6 +3673,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4021,6 +4023,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4028,6 +4031,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4101,12 +4105,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4116,6 +4125,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4137,10 +4147,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4321,7 +4339,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4344,6 +4363,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4562,6 +4584,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4613,6 +4637,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ea96947d813..79408dd01eb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e02fc6aa3e6..d037d015639 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -217,6 +217,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
 	bool		ii_Auxiliary;
+	Oid			ii_AuxiliaryForIndexId; /* if creating an auxiliary index,
+										   the OID of the main index; otherwise InvalidOid. */
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/x-patch] v21-0007-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 8-v21-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 1188eea06eb54c4a11fe203a3aa824b210b45a66 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v21 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/x-patch] v21-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.4K, 9-v21-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From fa8ca2e0ec754519a4a95769dfff3b34b07b5a43 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v21 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 09416450af9..893aed0b0d9 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3098,6 +3098,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3149,6 +3150,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9a423425aec..5bdb577624c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..38602e6a72d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d3d28a263fa..198795f010f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2492282213f..0341bb74325 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -181,12 +181,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -215,6 +216,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index cf48ae6d0c2..52dde57680d 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/x-patch] v21-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 10-v21-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 3fe60e8e1f5227ae58e361a0a97fc6ed603bd8f1 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v21 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4cbbf7f2d70..58ffa4306e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 052ebfe6a21..08a3cb28348 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c71d1b6f2e1..75909ada0dd 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -66,8 +66,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2532,7 +2530,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3853,7 +3851,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3971,17 +3969,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4007,6 +4012,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4026,7 +4033,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4037,7 +4044,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4046,6 +4054,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4054,7 +4064,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4071,6 +4082,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 775b995757e..9a423425aec 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0db23b981db..40154e5b2bb 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..4bd8a403cbb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1340,8 +1340,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a69f71a3ace..acd20dbfab8 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/x-patch] v21-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 11-v21-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 1fe7450f31843c3552423ee401c39d131e1c7de7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v21 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/x-patch] v21-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 12-v21-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 69718cf250606ceeecd81b88bc52ded945ea1900 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v21 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f2898fee5fc..e065804cf21 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4229,7 +4230,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4308,6 +4309,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index bdf862b2406..499cba145dd 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 514eae1037d..8851f0fda06 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -486,6 +486,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -696,6 +738,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -706,23 +750,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 54da8e7995b..86c64477eae 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/x-patch] v21-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 13-v21-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 4289b27fe73a1654bb619b6088308e5ad1abc9d6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v21 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4204088fa0d..a48682b8dbf 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..6d485b84d9f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -633,6 +634,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -674,7 +705,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1325,6 +1361,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..3b4d3c4d581 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9d70e89c1f3..47340de1d32 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index aa216683b74..3ca35f23ac3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e065804cf21..0db23b981db 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4107,9 +4101,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4124,7 +4115,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 549aedcfa99..170c6035fad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6899,6 +6900,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6954,6 +6956,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7011,6 +7018,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3a9424c19c9..418cbf656ee 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -42,6 +42,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8713e12cbfb..8df6ba9b89e 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index e680991f8d4..19d26408c2a 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc
+REGRESS = injection_points hashagg reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index d61149712fd..8476bfe72a7 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -37,6 +37,7 @@ tests += {
       'injection_points',
       'hashagg',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/x-patch] v21-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 14-v21-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From 6e4114e3a37bc4488b0b8b7cf33ac6851a43c90b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v21 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index a48682b8dbf..947dc79b138 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b4d3c4d581..4cbbf7f2d70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47340de1d32..052ebfe6a21 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3ca35f23ac3..775b995757e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8df6ba9b89e..a69f71a3ace 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v21-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 15-v21-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v21-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (30.5K, 16-v21-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v21-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 17-v21-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v21-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (37.4K, 18-v21-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

  [application/octet-stream] v21-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (95.6K, 19-v21-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v21-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (19.0K, 20-v21-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-07-03 00:23                           ` Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-07-03 00:23 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello!

Rebased again, patch structure and comments available here [0]. Quick
introduction poster - here [1].

Best regards,
Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Attachments:

  [application/octet-stream] v22-0010-Optimize-auxiliary-index-handling.patch (2.4K, 2-v22-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From a83a85d0217bed727681676f8029440181699967 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v22 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 97e5d2d68aa..b37684309e3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2933,6 +2933,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0edf54e852d..09b9b811def 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v22-0011-Refresh-snapshot-periodically-during-index-valid.patch (23.3K, 3-v22-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From a57c5b4548d9d9f2e4ac4d1eee3f3f0b648255d2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v22 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             |  7 +--
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 11 files changed, 146 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index cf14f474946..1626cee7a03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f592b09ec68..236e216d170 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 08a3cb28348..250d9d59b9a 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -444,7 +444,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 2678f7ab782..968a8f7725c 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b37684309e3..e20d8a60357 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3535,8 +3535,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3549,7 +3550,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3570,13 +3571,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3626,8 +3628,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3663,6 +3669,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3697,6 +3706,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3716,19 +3728,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3751,6 +3768,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e5259d1d82e..e6260f7011e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -592,7 +592,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1794,32 +1793,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1841,8 +1819,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4382,7 +4360,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4397,13 +4374,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4415,16 +4385,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4437,7 +4399,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6e280aa4e6a..c0aac2dab77 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/octet-stream] v22-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.2K, 4-v22-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 5d93d19e7f4b046719d42bdcb24d05473ad5d3a0 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v22 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 947dc79b138..a59f84a4251 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 629f6d5f2c0..df79b5850f9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 250d9d59b9a..f80379618b2 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1910,11 +1910,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e6260f7011e..1b8b9146eb1 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -115,7 +115,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -418,10 +417,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -442,8 +438,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -463,8 +458,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -578,7 +572,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1182,10 +1175,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1671,10 +1660,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1729,9 +1714,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1761,10 +1743,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1790,9 +1768,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1809,9 +1785,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1852,10 +1825,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1876,10 +1845,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3654,7 +3619,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4028,17 +3992,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4104,7 +4057,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4205,11 +4157,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4241,10 +4188,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4253,11 +4196,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4282,10 +4220,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4305,11 +4239,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4331,10 +4260,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4370,10 +4295,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4401,9 +4322,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4425,13 +4343,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4487,12 +4398,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4556,12 +4461,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4829,36 +4728,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 9f9b3fcfbf1..5e07466c737 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f4a62ed1ca7..b217b1aa951 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
+REGRESS = injection_points hashagg vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index ba7bc0cc384..7feaf05129c 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'vacuum',
       'cic_reset_snapshots',
     ],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v22-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.8K, 5-v22-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 9f3ee2bb04848a66e6803263240ac1cd9da50fc2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v22 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 334 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1120 insertions(+), 343 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4265a22d4de..8ccd69b14c2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6314,6 +6314,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6354,13 +6366,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6377,8 +6388,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58ffa4306e2..f592b09ec68 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5bdb577624c..6c8151e538a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2632,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2693,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3495,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3517,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3542,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3594,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3619,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3690,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3722,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3783,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e5dbbe61b81..5f6727785c5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1283,16 +1283,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7e04e5be2a9..e64be43fc3f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -553,6 +556,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -562,6 +566,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -583,6 +588,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -833,6 +839,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -928,7 +943,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1593,6 +1609,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1621,11 +1647,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1635,7 +1661,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1674,7 +1700,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1686,14 +1712,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1722,9 +1772,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1742,24 +1811,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1786,7 +1845,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1811,6 +1870,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3565,6 +3671,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3670,8 +3777,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3723,8 +3837,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3785,6 +3906,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3888,15 +4016,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3947,6 +4078,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3960,12 +4096,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3974,6 +4115,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3992,10 +4134,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4076,13 +4222,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4125,6 +4314,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4132,12 +4356,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4175,7 +4393,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4204,7 +4422,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4294,14 +4512,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4326,6 +4544,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4339,11 +4579,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4363,6 +4603,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 570b1ed9f2f..6e280aa4e6a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 7c736e7b03b..6e14577ef9b 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 9ade7b835e6..ca74844b5c6 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6cf828ca8d0..ed6c20a495c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,14 +2041,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index e21ff426519..2cff1ac29be 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v22-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 6-v22-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 898a398c7ade22f24a6fc294f9e83f90b776c2a5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v22 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6c8151e538a..97e5d2d68aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2633,7 +2646,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2694,7 +2708,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3869,6 +3884,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3941,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4242,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4332,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4381,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e64be43fc3f..e5259d1d82e 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -246,7 +246,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -944,7 +944,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3672,6 +3673,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4021,6 +4023,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4028,6 +4031,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4101,12 +4105,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4116,6 +4125,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4137,10 +4147,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4321,7 +4339,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4344,6 +4363,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4562,6 +4584,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4613,6 +4637,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f9f594b44cf..7e4ad950a1c 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9c409532c44..ab2d25a10d9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -220,6 +220,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index ca74844b5c6..aca6ec57ad7 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 2cff1ac29be..e1464eaa67c 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v22-0007-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 7-v22-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 9632ea36aa489f272865c4b3824694cabac64eb6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v22 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v22-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.4K, 8-v22-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From a3b9cf935351592cb60e051b33cd8abbff1d00d4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v22 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  69 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 264 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4cbbf7f2d70..58ffa4306e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 08884116aec..347b50d6e51 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 052ebfe6a21..08a3cb28348 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -321,20 +319,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -381,6 +379,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -429,8 +432,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -438,8 +442,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -470,7 +478,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -483,7 +491,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -539,7 +547,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -561,7 +569,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -575,7 +583,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1154,13 +1162,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1320,7 +1432,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1417,7 +1529,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,21 +1546,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1457,16 +1559,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1536,6 +1638,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1550,7 +1653,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1630,7 +1733,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1641,7 +1744,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1744,6 +1847,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1847,11 +1951,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1931,6 +2036,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1953,14 +2059,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index e6c9aaa0454..7cb1f3e1bc6 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 9aed207995f..e6bfca1bf63 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -64,8 +64,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2419,7 +2417,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3754,7 +3752,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3872,17 +3870,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -3908,6 +3913,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3927,7 +3934,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3938,7 +3945,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3947,6 +3955,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -3955,7 +3965,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -3972,6 +3983,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 775b995757e..9a423425aec 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b3840448a8b..7e04e5be2a9 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,8 +1694,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 5f70e8dddac..71a5c21e0df 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -133,6 +134,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -358,6 +360,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -400,6 +403,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1653,6 +1657,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1662,18 +1667,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0afe..4bd8a403cbb 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1340,8 +1340,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc3b551e8e9..570b1ed9f2f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v22-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 9-v22-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From d0bc9af891f9f827ebe05078a9124cfaab8c9add Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v22 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index a6dad54ff58..ca5214461e6 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 14036c27e87..83c14a70dc8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3082,6 +3082,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3133,6 +3134,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9a423425aec..5bdb577624c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7111d5d5334..e8f2fd99534 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -720,6 +720,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index dfbb4c85460..a121b4d31c9 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index d4650947c63..f6699b5efd6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e107d6e5f81..9c409532c44 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -157,8 +157,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -218,7 +218,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 236eba2540e..dfacf3a7ac2 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5137,7 +5137,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5151,7 +5152,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5176,9 +5178,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5187,12 +5189,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5201,7 +5204,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v22-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 10-v22-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From b85f826ff6a288b479d8adde9d2b087f03ac9172 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v22 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index a48682b8dbf..947dc79b138 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4cea1612ce6..629f6d5f2c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3b4d3c4d581..4cbbf7f2d70 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47340de1d32..052ebfe6a21 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -321,22 +321,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -485,8 +483,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1420,6 +1417,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1437,12 +1435,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1450,6 +1457,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1510,7 +1522,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1537,7 +1549,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1613,6 +1626,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1621,7 +1641,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1645,7 +1666,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1895,6 +1916,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1949,11 +1971,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1989,4 +2015,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 3ca35f23ac3..775b995757e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index ed35c58c2c3..8a15dd72b91 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -367,7 +367,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ad440ff024c..f251bc52895 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -342,14 +342,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 41a2d095d2c..fc3b551e8e9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v22-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 11-v22-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From bf437057300421d9dc6212e0a36865449082d7a3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v22 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 3048e044aec..e59197bb35e 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 0d9c2b0b653..a6dad54ff58 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4204088fa0d..a48682b8dbf 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a65acd89104..4cea1612ce6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9e707167d98..56981147ae1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 0dcd6ee817e..6d485b84d9f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -633,6 +634,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -674,7 +705,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1325,6 +1361,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cb4bc35c93e..3b4d3c4d581 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 9d70e89c1f3..47340de1d32 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -258,7 +258,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -321,18 +321,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -480,6 +484,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -535,7 +542,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -557,18 +564,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1409,6 +1419,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1434,9 +1445,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1490,6 +1508,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1584,6 +1604,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1600,6 +1622,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index aa216683b74..3ca35f23ac3 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index bd291d05f68..b3840448a8b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1694,23 +1694,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4107,9 +4101,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4124,7 +4115,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 549aedcfa99..170c6035fad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -62,6 +62,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6899,6 +6900,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6954,6 +6956,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7011,6 +7018,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..0ef0957a627 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..41a2d095d2c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -25,6 +25,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..f4a62ed1ca7 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum
+REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..ba7bc0cc384 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -38,6 +38,7 @@ tests += {
       'hashagg',
       'reindex_conc',
       'vacuum',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.project_build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v22-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 12-v22-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From c7424f44a086433d2eff6153476e0fd0c6b5b576 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v22 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v22-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 13-v22-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v22-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 14-v22-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 38662117e7cf8e040715f429a90beae0605508f0 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v22 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index f11c43a0ed7..3048e044aec 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6f753ab6d7a..bd291d05f68 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1790,6 +1790,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4229,7 +4230,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4308,6 +4309,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..0edf54e852d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 514eae1037d..8851f0fda06 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -486,6 +486,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -696,6 +738,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -706,23 +750,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 54da8e7995b..86c64477eae 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 59233b64730..0c720e450e9 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -716,12 +716,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -756,8 +758,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -769,30 +771,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -815,7 +863,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -835,27 +889,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -875,7 +925,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -883,6 +933,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -920,27 +974,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -948,7 +1010,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index ea35f30f494..ad440ff024c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -447,6 +448,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v22-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 15-v22-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v22-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (30.5K, 16-v22-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v22-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (94.9K, 17-v22-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v22-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (37.3K, 18-v22-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

  [application/octet-stream] v22-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (19.0K, 19-v22-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-07-07 12:00                             ` Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Sergey Sargsyan @ 2025-07-07 12:00 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

I’ve tested the patch across several thousand test cases, and no faults of
any kind have been observed.
Additionally, I independently built a closed banking transaction system to
verify consistency during REINDEX CONCURRENTLY while multiple backends were
writing simultaneously. The results showed no missing transactions, and the
validation logic worked exactly as expected. On large tables, I observed a
significant speedup—often several times faster.

I believe this patch is highly valuable, as REINDEX CONCURRENTLY is a
common maintenance operation. I also noticed that there is a separate
thread working on adding support for concurrent reindexing of partitioned
indexes. Without this patch, that feature would likely suffer from serious
performance issues due to the need to reindex many indexes in one go—making
the process both time-consuming and lock-intensive.

Best regards,
S

On Thu, Jul 3, 2025, 3:24 AM Mihail Nikalayeu <[email protected]>
wrote:

> Hello!
>
> Rebased again, patch structure and comments available here [0]. Quick
> introduction poster - here [1].
>
> Best regards,
> Mikhail.
>
> [0]:
> https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
> [1]:
> https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf
>


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
@ 2025-07-10 14:30                               ` Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-07-10 14:30 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Added patch to the offer book of review marketplace [0].

Best regards,
Mikhail.

[0]: https://wiki.postgresql.org/wiki/Review_Marketplace#Offer_book





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-09-05 00:25                                 ` Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-09-05 00:25 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased, patch structure and comments available here [0].
Quick introduction poster - here [1].

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v23-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 2-v23-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 77e720991eeb7c3bb9cb2228562bfa4802ea4a63 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v23 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 289969e581b..fe2f1ff236c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2633,7 +2646,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2694,7 +2708,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3869,6 +3884,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3941,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4242,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4332,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4381,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8c721e20992..7b3d4b19288 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3671,6 +3672,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4020,6 +4022,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4027,6 +4030,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4100,12 +4104,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4115,6 +4124,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4136,10 +4146,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4320,7 +4338,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4343,6 +4362,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4561,6 +4583,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4612,6 +4636,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 082a3575d62..71c3b993bd3 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 17fcd6dd19f..40133e24b2d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -220,6 +220,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index a3e85ba1310..85cd088d080 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v23-0011-Refresh-snapshot-periodically-during-index-valid.patch (23.3K, 3-v23-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 7afecf46db5704d5f4ab4c4b5317b2134a604524 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v23 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             |  7 +--
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 11 files changed, 146 insertions(+), 83 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index cf14f474946..1626cee7a03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3cac122f7a7..409852f23e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ee94ab509e7..4f936a6cd98 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -445,7 +445,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 8f8a1ad7796..d57485cefc2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6ec6d59538f..4b8ddd6c2ea 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3535,8 +3535,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3549,7 +3550,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3570,13 +3571,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3626,8 +3628,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3663,6 +3669,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3697,6 +3706,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3716,19 +3728,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3751,6 +3768,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b3d4b19288..95d9ba57324 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1793,32 +1792,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1840,8 +1818,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4381,7 +4359,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4396,13 +4373,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4414,16 +4384,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4436,7 +4398,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6e280aa4e6a..c0aac2dab77 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,10 +708,9 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void 		(*index_validate_scan) (Relation table_rel,
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
 												Relation index_rel,
 												struct IndexInfo *index_info,
-												Snapshot snapshot,
 												struct ValidateIndexState *state,
 												struct ValidateIndexState *aux_state);
 
@@ -1825,18 +1824,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
 						  struct ValidateIndexState *state,
 						  struct ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/octet-stream] v23-0010-Optimize-auxiliary-index-handling.patch (2.4K, 4-v23-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From d3ffc65be0c53870b0b4b3d3d014f7fe30f8dbaa Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v23 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fe2f1ff236c..6ec6d59538f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2933,6 +2933,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0edf54e852d..09b9b811def 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v23-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.8K, 5-v23-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From d6cf068e7444f6ea9343eb939556d6957f726a4a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v23 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 334 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  28 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1120 insertions(+), 343 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..5c48e529e4a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6327,6 +6327,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6367,13 +6379,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6390,8 +6401,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 42748c01a49..3cac122f7a7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d0cabde8140..289969e581b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2632,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2693,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3495,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3517,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3542,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3594,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3619,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3690,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3722,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3783,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 1b3c5a55882..3f80a9fa66e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1291,16 +1291,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dda1eb0e94c..8c721e20992 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1721,9 +1771,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1741,24 +1810,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1785,7 +1844,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1810,6 +1869,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3564,6 +3670,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3669,8 +3776,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3722,8 +3836,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3784,6 +3905,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3887,15 +4015,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3946,6 +4077,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3959,12 +4095,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3973,6 +4114,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3991,10 +4133,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4075,13 +4221,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4124,6 +4313,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4131,12 +4355,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4174,7 +4392,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4203,7 +4421,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4293,14 +4511,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4325,6 +4543,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4338,11 +4578,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4362,6 +4602,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 570b1ed9f2f..6e280aa4e6a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -708,11 +708,12 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	void 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												Snapshot snapshot,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1820,19 +1821,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
 						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 98e68e972be..a3e85ba1310 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..ae3bfc3688e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,14 +2050,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v23-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.2K, 6-v23-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From e4e667312cc6cee3122be2ef92f316f2dfa75ccb Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v23 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5554cfa6f4d..cebcb777ef3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bf26106aa5e..829ecb4ed41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 4f936a6cd98..f4ea4cce04d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 95d9ba57324..2480c6e8cf0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -114,7 +114,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -417,10 +416,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -441,8 +437,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -462,8 +457,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -577,7 +571,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1670,10 +1659,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1728,9 +1713,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1760,10 +1742,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1789,9 +1767,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1808,9 +1784,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1851,10 +1824,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1875,10 +1844,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3653,7 +3618,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4027,17 +3991,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4103,7 +4056,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4204,11 +4156,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4240,10 +4187,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4252,11 +4195,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4281,10 +4219,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4304,11 +4238,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4330,10 +4259,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4369,10 +4294,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4400,9 +4321,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4424,13 +4342,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4486,12 +4397,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4555,12 +4460,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4828,36 +4727,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..f47d268d6c7 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f4a62ed1ca7..b217b1aa951 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
+REGRESS = injection_points hashagg vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index ba7bc0cc384..7feaf05129c 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'vacuum',
       'cic_reset_snapshots',
     ],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v23-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 7-v23-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From 720ae26f8d7dd42671e14bf6ddcc2cdcce28f600 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v23 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 186edd0d229..5554cfa6f4d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2f947d36619..bf26106aa5e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e32ee739733..a7e16871af6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7b09ad878b7..53b7ddfff0e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -322,22 +322,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -486,8 +484,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index a56c5eceb14..6f04c365994 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 248a39c164b..7e4560d0f35 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..f16284d4d0d 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -371,7 +371,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8e1a918f130..68ea98405bb 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -353,14 +353,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 41a2d095d2c..fc3b551e8e9 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1135,7 +1135,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1753,9 +1754,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v23-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.1K, 8-v23-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 163d8797c6d0cbaf60d75e7b1cabb96359964ecd Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v23 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2445f001700..25a32a13565 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index b5de68b7232..331b4f2b916 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e462..186edd0d229 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e9d4b27427e..2f947d36619 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9b2ec9815f1..bfc27474433 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e3e7307ef5f..ea8e95b3e86 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -633,6 +634,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -674,7 +705,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1336,6 +1372,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..e32ee739733 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..7b09ad878b7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -322,18 +322,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -481,6 +485,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -536,7 +543,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -558,18 +565,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c4029a4f3d3..248a39c164b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b10429c3721..a7994652ead 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,23 +1693,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4106,9 +4100,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4123,7 +4114,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 41bd8353430..2a25bb0654a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -63,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6927,6 +6928,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6982,6 +6984,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7039,6 +7046,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index a2bd5a897f8..0ef0957a627 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 struct TupleTableSlot;
 struct VacuumCutoffs;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1c9e802a6b1..41a2d095d2c 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -25,6 +25,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -62,6 +63,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -893,7 +905,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -901,6 +914,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1730,6 +1752,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..f4a62ed1ca7 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum
+REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..ba7bc0cc384 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -38,6 +38,7 @@ tests += {
       'hashagg',
       'reindex_conc',
       'vacuum',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.project_build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v23-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 9-v23-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 5db35a4b8bbd3442d20eda8e8a1f445a22cf2fdf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v23 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 331b4f2b916..d3451078176 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 932701d8420..c25aac12f43 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3081,6 +3081,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3132,6 +3133,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e721389afa4..d0cabde8140 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8ea2913d906..385a1a926a8 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 5b2ab181b5f..b99916edb4a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -73,6 +73,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 118d6da1ace..d5ae0246c90 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index de782014b2d..17fcd6dd19f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -157,8 +157,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -218,7 +218,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index a79325e8a2f..8e7c9de12bb 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5139,7 +5139,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5153,7 +5154,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5178,9 +5180,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5189,12 +5191,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5203,7 +5206,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v23-0007-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 10-v23-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 54857bae3cd14cc4a99ad0eb6113d27f2c45ae72 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v23 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v23-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.7K, 11-v23-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 189de13acf7b537686f928a25170184623d4277c Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v23 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  71 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 266 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a7e16871af6..42748c01a49 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index ab0b6946cb0..9a9ee55ff1b 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -149,7 +149,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -375,7 +375,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -790,12 +790,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 53b7ddfff0e..ee94ab509e7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -84,6 +84,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -102,6 +103,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -204,15 +206,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -304,8 +304,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -322,20 +320,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -382,6 +380,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -430,8 +433,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -439,8 +443,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -471,7 +479,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -484,7 +492,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -540,7 +548,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -562,7 +570,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -576,7 +584,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1155,13 +1163,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index b88c396195a..ed5425ac6ec 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -688,7 +688,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -719,7 +719,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -968,7 +968,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -989,7 +989,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1028,7 +1028,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1150,7 +1150,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index edfea2acaff..f14f7b6d1cb 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -67,8 +67,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2422,7 +2420,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3757,7 +3755,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3875,17 +3873,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -3911,6 +3916,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -3930,7 +3937,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -3941,7 +3948,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -3950,6 +3958,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -3958,7 +3968,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -3975,6 +3986,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7e4560d0f35..e721389afa4 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a7994652ead..dda1eb0e94c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,8 +1693,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 890cdbe1204..1ce2e2ad63c 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -24,6 +24,8 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "executor/executor.h"
@@ -33,6 +35,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -134,6 +137,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -359,6 +363,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -401,6 +406,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1654,6 +1660,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1663,18 +1670,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8fd..0c9f0e1f3a6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1340,8 +1340,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fc3b551e8e9..570b1ed9f2f 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1754,9 +1754,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v23-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 12-v23-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From 10bb4037396abf064f043b2e362a4ee0ed385c8d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v23 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca2bde62e82..b10429c3721 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..0edf54e852d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 514eae1037d..8851f0fda06 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -486,6 +486,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -696,6 +738,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -706,23 +750,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7c6c2c1f6e4..f0917f3d907 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 4536bdd6cb4..778da296fd5 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -802,12 +802,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -842,8 +844,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -855,30 +857,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -901,7 +949,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -921,27 +975,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -961,7 +1011,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -969,6 +1019,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1006,27 +1060,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1034,7 +1096,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v23-only-part-3-0005-Optimize-auxiliary-index-handling.patch_ (2.4K, 13-v23-only-part-3-0005-Optimize-auxiliary-index-handling.patch_)
  download

  [application/octet-stream] v23-only-part-3-0006-Refresh-snapshot-periodically-during.patch_ (20.7K, 14-v23-only-part-3-0006-Refresh-snapshot-periodically-during.patch_)
  download

  [application/octet-stream] v23-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_ (30.5K, 15-v23-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch_)
  download

  [application/octet-stream] v23-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 16-v23-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 603c039ac4bb9c3ec1993b372c105423120be952 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v23 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v23-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_ (94.9K, 17-v23-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch_)
  download

  [application/octet-stream] v23-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_ (19.0K, 18-v23-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch_)
  download

  [application/octet-stream] v23-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_ (37.3K, 19-v23-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch_)
  download

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-09-28 09:26                                   ` Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-09-28 09:26 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello, everyone!

Rebased, patch structure and comments available here [0].
Quick introduction poster - here [1].

[0]: https://www.postgresql.org/message-id/flat/CADzfLwVOcZ9mg8gOG%2BKXWurt%3DMHRcqNv3XSECYoXyM3ENrxyfQ%4...
[1]: https://www.postgresql.org/message-id/attachment/176651/STIR-poster.pdf


Best regards,
Mikhail.


Attachments:

  [application/octet-stream] nocfbot-v24-only-part-3-0005-Optimize-auxiliary-index-handling.patch (2.4K, 2-nocfbot-v24-only-part-3-0005-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 1b1ce7286c172070b4a1d0d58522f98ee7b0e489 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v24-only-part-3 5/6] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9978192f2d8..976e4d6e980 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2916,6 +2916,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0edf54e852d..09b9b811def 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] nocfbot-v24-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch (37.3K, 3-nocfbot-v24-only-part-3-0001-Add-STIR-access-method-and-flags-rel.patch)
  download | inline diff:
From 975c1c14e37dabb60cad283c58a446d27e4b19d2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v24-only-part-3 1/6] Add STIR access method and flags related
 to auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index b5de68b7232..6bfd190605b 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..d0276bf483b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3087,6 +3087,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3138,6 +3139,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..8e509a51c11 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 12b4f3fd36e..b747c6e7804 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ac62f6a6abd..0d0a0f8d73f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,6 +75,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 01eba3b5a19..0d29115f200 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..7263c5e29a9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index a79325e8a2f..8e7c9de12bb 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5139,7 +5139,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5153,7 +5154,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5178,9 +5180,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5189,12 +5191,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5203,7 +5206,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] nocfbot-v24-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch (94.4K, 4-nocfbot-v24-only-part-3-0003-Use-auxiliary-indexes-for-concurrent.patch)
  download | inline diff:
From 78de6899b16c992b8f8a35c76651d607b0733889 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v24-only-part-3 3/6] Use auxiliary indexes for concurrent
 index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 544 ++++++++++++++-------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1122 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..5c48e529e4a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6327,6 +6327,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6367,13 +6379,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6390,8 +6401,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..28e2a1604c4 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c85e5332ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1743,242 +1744,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e509a51c11..d4ac7a0e606 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -759,6 +764,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +801,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1411,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1472,6 +1487,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2452,7 +2615,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2512,7 +2676,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3288,12 +3453,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3303,14 +3477,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3495,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3520,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,6 +3572,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3413,15 +3597,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3668,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,6 +3700,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3533,6 +3761,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4037,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4286,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4312,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c77fa0234bb..88d94e7ced9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1291,16 +1291,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b10429c3721..0c7740e2c85 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,44 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1727,9 +1783,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1747,24 +1822,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1791,7 +1856,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1816,6 +1881,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3570,6 +3682,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3675,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3728,8 +3848,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3790,6 +3917,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3893,15 +4027,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3952,6 +4089,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3965,12 +4107,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3979,6 +4126,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3997,10 +4145,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4081,13 +4233,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4134,6 +4333,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4141,12 +4375,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4184,7 +4412,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4213,7 +4441,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4303,14 +4531,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4335,6 +4563,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4348,11 +4598,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4372,6 +4622,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..22446b32157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -706,7 +706,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1803,19 +1804,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 98e68e972be..a3e85ba1310 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..ae3bfc3688e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,14 +2050,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] nocfbot-v24-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch (19.0K, 5-nocfbot-v24-only-part-3-0002-Add-Datum-storage-support-to-tuplest.patch)
  download | inline diff:
From 3ddbaf9289a60d7e7ab8e3f4a2253bb4bc73e496 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v24-only-part-3 2/6] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] nocfbot-v24-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch (30.5K, 6-nocfbot-v24-only-part-3-0004-Track-and-drop-auxiliary-indexes-in-.patch)
  download | inline diff:
From 5daa012bb161e0f47745b9bbac3b4c322c0ac2b9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v24-only-part-3 4/6] Track and drop auxiliary indexes in
 DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d4ac7a0e606..9978192f2d8 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -775,6 +775,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1180,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1412,7 +1423,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1580,7 +1592,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2616,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2677,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3847,6 +3862,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3903,6 +3919,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4191,7 +4220,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4280,13 +4310,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4312,18 +4359,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0c7740e2c85..5fb447195b5 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3683,6 +3684,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4032,6 +4034,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4039,6 +4042,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4112,12 +4116,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4127,6 +4136,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4148,10 +4158,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4340,7 +4358,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4363,6 +4382,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4581,6 +4603,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4632,6 +4656,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fc89352b661..0cc88d3064f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1533,6 +1533,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1593,9 +1595,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1647,6 +1660,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1675,12 +1716,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7263c5e29a9..de8f962a792 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index a3e85ba1310..85cd088d080 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] nocfbot-v24-only-part-3-0006-Refresh-snapshot-periodically-during.patch (21.1K, 7-nocfbot-v24-only-part-3-0006-Refresh-snapshot-periodically-during.patch)
  download | inline diff:
From 37a9ba542c320759fa60bc1360a58ed439157dec Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v24-only-part-3 6/6] Refresh snapshot periodically during
 index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 43 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 25 ++++----
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 140 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 28e2a1604c4..604bdda59ff 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c85e5332ba2..12baa8728d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1996,23 +1996,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2023,14 +2026,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2046,6 +2051,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2079,6 +2107,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2134,6 +2163,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2143,9 +2186,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 8f8a1ad7796..d57485cefc2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 976e4d6e980..401be545ba2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -68,6 +68,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3513,8 +3514,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3527,7 +3529,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3548,13 +3550,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3604,8 +3607,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3641,6 +3648,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3675,6 +3685,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3694,19 +3707,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3729,6 +3747,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5fb447195b5..d8bce846c23 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1805,32 +1804,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1852,8 +1830,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4401,7 +4379,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4416,13 +4393,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4434,16 +4404,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4456,7 +4418,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 22446b32157..5fa60e8e37b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -702,12 +702,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1808,20 +1807,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/octet-stream] v24-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 8-v24-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 70b8e6147cebcc427b4df419cac7cc7f9056973b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v24 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v24-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.2K, 9-v24-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From 19b8842d9bdd8da3d5a88ea466fb64658750ae18 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v24 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 427 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2445f001700..25a32a13565 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index b5de68b7232..331b4f2b916 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e462..186edd0d229 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e9d4b27427e..2f947d36619 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9b2ec9815f1..bfc27474433 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..d73968475c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -633,6 +634,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -674,7 +705,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1336,6 +1372,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..e32ee739733 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..7b09ad878b7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -322,18 +322,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -481,6 +485,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -536,7 +543,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -558,18 +565,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..edc07b72018 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1544,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3253,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3317,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b10429c3721..a7994652ead 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,23 +1693,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4106,9 +4100,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4123,7 +4114,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 41bd8353430..2a25bb0654a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -63,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6927,6 +6928,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6982,6 +6984,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7039,6 +7046,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..8b3ec6430ad 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 typedef struct GlobalVisState GlobalVisState;
 typedef struct TupleTableSlot TupleTableSlot;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..71af14d1c31 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -25,6 +25,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -63,6 +64,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -899,7 +911,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -907,6 +920,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1739,6 +1761,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..f4a62ed1ca7 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum
+REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..ba7bc0cc384 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -38,6 +38,7 @@ tests += {
       'hashagg',
       'reindex_conc',
       'vacuum',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.project_build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v24-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 10-v24-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From bad826bea3424e91f38b05262157b0ae5743723d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v24 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca2bde62e82..b10429c3721 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..0edf54e852d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1f2da072632..f77fe42a2a9 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4c5647ac38a..f6d2a6ede93 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index da5d901ec3c..d0c4386f798 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -803,12 +803,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -843,8 +845,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -856,30 +858,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -902,7 +950,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -922,27 +976,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -962,7 +1012,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -970,6 +1020,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1007,27 +1061,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1035,7 +1097,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [application/octet-stream] v24-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.7K, 11-v24-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From da1b470449a5bb416b11d688ee0bc11133df4d25 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v24 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  71 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 266 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a7e16871af6..42748c01a49 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index ab0b6946cb0..9a9ee55ff1b 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -149,7 +149,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -375,7 +375,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -790,12 +790,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 53b7ddfff0e..ee94ab509e7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -84,6 +84,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -102,6 +103,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -204,15 +206,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -304,8 +304,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -322,20 +320,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -382,6 +380,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -430,8 +433,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -439,8 +443,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -471,7 +479,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -484,7 +492,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -540,7 +548,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -562,7 +570,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -576,7 +584,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1155,13 +1163,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index b88c396195a..ed5425ac6ec 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -688,7 +688,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -719,7 +719,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -968,7 +968,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -989,7 +989,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1028,7 +1028,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1150,7 +1150,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 41b4fbd1c37..3fff5f45a9d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -68,8 +68,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2423,7 +2421,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3859,7 +3857,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3977,17 +3975,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4013,6 +4018,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4032,7 +4039,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4043,7 +4050,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4052,6 +4060,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4060,7 +4070,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4077,6 +4088,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 371863895dd..c017226fa31 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3323,9 +3323,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a7994652ead..dda1eb0e94c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,8 +1693,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 890cdbe1204..1ce2e2ad63c 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -24,6 +24,8 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "executor/executor.h"
@@ -33,6 +35,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -134,6 +137,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -359,6 +363,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -401,6 +406,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1654,6 +1660,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1663,18 +1670,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8fd..0c9f0e1f3a6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1340,8 +1340,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 613615c78cd..8f5aa0d7146 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1763,9 +1763,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v24-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 12-v24-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From cf40167f621111c61489d9640769ccbc8885d0f2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v24 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 186edd0d229..5554cfa6f4d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2f947d36619..bf26106aa5e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e32ee739733..a7e16871af6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7b09ad878b7..53b7ddfff0e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -322,22 +322,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -486,8 +484,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..8b33b6278ce 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index edc07b72018..371863895dd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1532,7 +1532,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..f16284d4d0d 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -371,7 +371,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8e1a918f130..68ea98405bb 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -353,14 +353,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 71af14d1c31..613615c78cd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1140,7 +1140,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1762,9 +1763,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v24-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 13-v24-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 77817840f734acfeae859e539cc7abd5c6d9cb0b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v24 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 331b4f2b916..d3451078176 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..d0276bf483b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3087,6 +3087,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3138,6 +3139,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c017226fa31..1ad59effea2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3433,6 +3433,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 12b4f3fd36e..b747c6e7804 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ac62f6a6abd..0d0a0f8d73f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,6 +75,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 01eba3b5a19..0d29115f200 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..1cd036a0594 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index a79325e8a2f..8e7c9de12bb 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5139,7 +5139,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5153,7 +5154,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5178,9 +5180,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5189,12 +5191,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5203,7 +5206,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v24-0007-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 14-v24-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From c2ce5e77bd650d555229f0a98dce7dfbe7d8b848 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v24 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v24-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.4K, 15-v24-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 5aac966264b400fd9a89c8901574a584a15edf13 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v24 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 334 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  20 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1116 insertions(+), 339 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..5c48e529e4a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6327,6 +6327,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6367,13 +6379,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6390,8 +6401,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 42748c01a49..3cac122f7a7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1ad59effea2..a36402eb649 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2469,7 +2632,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2529,7 +2693,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3306,12 +3471,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3321,18 +3495,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3340,12 +3517,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3363,22 +3542,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3411,6 +3594,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3435,15 +3619,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3466,27 +3690,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3495,6 +3722,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3555,6 +3783,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3826,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4068,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4093,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c77fa0234bb..88d94e7ced9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1291,16 +1291,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dda1eb0e94c..8c721e20992 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1721,9 +1771,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1741,24 +1810,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1785,7 +1844,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1810,6 +1869,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3564,6 +3670,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3669,8 +3776,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3722,8 +3836,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3784,6 +3905,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3887,15 +4015,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3946,6 +4077,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3959,12 +4095,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3973,6 +4114,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3991,10 +4133,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4075,13 +4221,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4124,6 +4313,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4131,12 +4355,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4174,7 +4392,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4203,7 +4421,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4293,14 +4511,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4325,6 +4543,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4338,11 +4578,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4362,6 +4602,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8f5aa0d7146..5bc16f07a86 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -718,7 +718,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1829,19 +1830,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 98e68e972be..a3e85ba1310 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..ae3bfc3688e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,14 +2050,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v24-0011-Refresh-snapshot-periodically-during-index-valid.patch (23.5K, 16-v24-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 2438180f70600dc82ba715245c0dc8cdeac465f3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v24 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 15 ++---
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 11 files changed, 150 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index cf14f474946..1626cee7a03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3cac122f7a7..409852f23e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ee94ab509e7..4f936a6cd98 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -445,7 +445,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 8f8a1ad7796..d57485cefc2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index bb470601c2f..ba7115e4dc2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3535,8 +3535,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3549,7 +3550,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3570,13 +3571,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3626,8 +3628,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3663,6 +3669,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3697,6 +3706,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3716,19 +3728,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3751,6 +3768,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b3d4b19288..95d9ba57324 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1793,32 +1792,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1840,8 +1818,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4381,7 +4359,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4396,13 +4373,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4414,16 +4384,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4436,7 +4398,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5bc16f07a86..66d5dfb96d6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,12 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1834,18 +1833,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/octet-stream] v24-0010-Optimize-auxiliary-index-handling.patch (2.4K, 17-v24-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From a33bd2ef19019bc4a0fb59d2c0a6b52bcae8fdb8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v24 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 79648ea71a0..bb470601c2f 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2933,6 +2933,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0edf54e852d..09b9b811def 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [application/octet-stream] v24-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 18-v24-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From de297b3a2e8603dd27866c698fd271fcce722bf4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v24 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index a36402eb649..79648ea71a0 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2633,7 +2646,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2694,7 +2708,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3869,6 +3884,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3941,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4242,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4332,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4381,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8c721e20992..7b3d4b19288 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3671,6 +3672,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4020,6 +4022,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4027,6 +4030,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4100,12 +4104,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4115,6 +4124,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4136,10 +4146,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4320,7 +4338,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4343,6 +4362,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4561,6 +4583,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4612,6 +4636,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fc89352b661..0cc88d3064f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1533,6 +1533,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1593,9 +1595,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1647,6 +1660,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1675,12 +1716,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1cd036a0594..53e15502ec1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index a3e85ba1310..85cd088d080 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v24-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.2K, 19-v24-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 0f44c70f973fc202986c7da38c1fc8f7738e541c Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v24 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5554cfa6f4d..cebcb777ef3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bf26106aa5e..829ecb4ed41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 4f936a6cd98..f4ea4cce04d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 95d9ba57324..2480c6e8cf0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -114,7 +114,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -417,10 +416,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -441,8 +437,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -462,8 +457,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -577,7 +571,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1670,10 +1659,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1728,9 +1713,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1760,10 +1742,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1789,9 +1767,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1808,9 +1784,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1851,10 +1824,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1875,10 +1844,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3653,7 +3618,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4027,17 +3991,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4103,7 +4056,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4204,11 +4156,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4240,10 +4187,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4252,11 +4195,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4281,10 +4219,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4304,11 +4238,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4330,10 +4259,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4369,10 +4294,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4400,9 +4321,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4424,13 +4342,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4486,12 +4397,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4555,12 +4460,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4828,36 +4727,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..f47d268d6c7 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f4a62ed1ca7..b217b1aa951 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
+REGRESS = injection_points hashagg vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index ba7bc0cc384..7feaf05129c 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'vacuum',
       'cic_reset_snapshots',
     ],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-10-28 18:37                                     ` Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-10-28 18:37 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Fixed and assert caused
https://cirrus-ci.com/task/4890065790304256?logs=test_world#L157 to
fail.


Attachments:

  [text/x-patch] v25-0007-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 2-v25-0007-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From d2430256e4ad3638b44c8e6100daf0a6866434e3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v25 07/12] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.48.1



  [text/x-patch] v25-0012-Remove-PROC_IN_SAFE_IC-optimization.patch (21.2K, 3-v25-0012-Remove-PROC_IN_SAFE_IC-optimization.patch)
  download | inline diff:
From 43596b70d9e4ba0ec0e6b93a0bd8e888deefaf1a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:24:48 +0100
Subject: [PATCH v25 12/12] Remove PROC_IN_SAFE_IC optimization

This optimization allowed concurrent index builds to ignore other indexes without expressions or predicates. With the new snapshot handling approach that periodically refreshes snapshots, this optimization is no longer necessary.

The change simplifies concurrent index build code by:
- removing the PROC_IN_SAFE_IC process status flag
- eliminating set_indexsafe_procflags() calls and related logic
- removing special case handling in GetCurrentVirtualXIDs()
- removing related test cases and injection points
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/gin/gininsert.c            |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 9 files changed, 13 insertions(+), 237 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5554cfa6f4d..cebcb777ef3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2893,11 +2893,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bf26106aa5e..829ecb4ed41 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2106,11 +2106,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 4f936a6cd98..f4ea4cce04d 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1911,11 +1911,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 95d9ba57324..2480c6e8cf0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -114,7 +114,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -417,10 +416,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -441,8 +437,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -462,8 +457,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -577,7 +571,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1181,10 +1174,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1670,10 +1659,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1728,9 +1713,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1760,10 +1742,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1789,9 +1767,7 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
+
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
@@ -1808,9 +1784,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
@@ -1851,10 +1824,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
@@ -1875,10 +1844,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	/* Now wait for all transaction to ignore auxiliary because it is dead */
@@ -3653,7 +3618,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -4027,17 +3991,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe", NULL);
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe", NULL);
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4103,7 +4056,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
 		newidx->junkAuxIndexId = junkAuxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4204,11 +4156,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4240,10 +4187,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
 
@@ -4252,11 +4195,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4281,10 +4219,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4304,11 +4238,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4330,10 +4259,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4369,10 +4294,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4400,9 +4321,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4424,13 +4342,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4486,12 +4397,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4555,12 +4460,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4828,36 +4727,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c6f5ebceefd..f47d268d6c7 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f4a62ed1ca7..b217b1aa951 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
+REGRESS = injection_points hashagg vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index ba7bc0cc384..7feaf05129c 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -36,7 +36,6 @@ tests += {
     'sql': [
       'injection_points',
       'hashagg',
-      'reindex_conc',
       'vacuum',
       'cic_reset_snapshots',
     ],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.48.1



  [text/x-patch] v25-0011-Refresh-snapshot-periodically-during-index-valid.patch (23.5K, 4-v25-0011-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From e78fcf7cfa08cc7d86c199067b1da7d96042a2bf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:18:32 +0200
Subject: [PATCH v25 11/12] Refresh snapshot periodically during index
 validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 doc/src/sgml/ref/create_index.sgml       | 11 +++-
 doc/src/sgml/ref/reindex.sgml            | 11 ++--
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 15 ++---
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 11 files changed, 150 insertions(+), 87 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index cf14f474946..1626cee7a03 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -881,9 +881,14 @@ Indexes:
   </para>
 
   <para>
-   Like any long-running transaction, <command>CREATE INDEX</command> on a
-   table can affect which tuples can be removed by concurrent
-   <command>VACUUM</command> on any other table.
+   Due to the improved implementation using periodically refreshed snapshots and
+   auxiliary indexes, concurrent index builds have minimal impact on concurrent
+   <command>VACUUM</command> operations. The system automatically advances its
+   internal transaction horizon during the build process, allowing
+   <command>VACUUM</command> to remove dead tuples on other tables without
+   having to wait for the entire index build to complete. Only during very brief
+   periods when snapshots are being refreshed might there be any temporary effect
+   on concurrent <command>VACUUM</command> operations.
   </para>
 
   <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index d62791ff9c3..60f4d0d680f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -502,10 +502,13 @@ Indexes:
    </para>
 
    <para>
-    Like any long-running transaction, <command>REINDEX</command> on a table
-    can affect which tuples can be removed by concurrent
-    <command>VACUUM</command> on any other table.
-   </para>
+    <command>REINDEX CONCURRENTLY</command> has minimal
+    impact on which tuples can be removed by concurrent <command>VACUUM</command>
+    operations on other tables. This is achieved through periodic snapshot
+    refreshes and the use of auxiliary indexes during the rebuild process,
+    allowing the system to advance its transaction horizon regularly rather than
+    maintaining a single long-running snapshot.
+  </para>
 
    <para>
     <command>REINDEX SYSTEM</command> does not support
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 6f718feb6d5..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ use the key value from the live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3cac122f7a7..409852f23e2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2034,23 +2034,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2061,14 +2064,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2084,6 +2089,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2117,6 +2145,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2172,6 +2201,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2181,9 +2224,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ee94ab509e7..4f936a6cd98 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -445,7 +445,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 8f8a1ad7796..d57485cefc2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 75607a34cf2..7d106a3d233 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3536,8 +3536,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3550,7 +3551,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3571,13 +3572,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3627,8 +3629,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3664,6 +3670,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3698,6 +3707,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3717,19 +3729,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3752,6 +3769,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 7b3d4b19288..95d9ba57324 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1793,32 +1792,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1840,8 +1818,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4381,7 +4359,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4396,13 +4373,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4414,16 +4384,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4436,7 +4398,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 5bc16f07a86..66d5dfb96d6 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,12 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1834,18 +1833,16 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
 	return table_rel->rd_tableam->index_validate_scan(table_rel,
 													  index_rel,
 													  index_info,
-													  snapshot,
 													  state,
 													  auxstate);
 }
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 7d82cd2eb56..15e345c7a19 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index d51b4e8cd13..6c780681967 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.48.1



  [text/x-patch] v25-0010-Optimize-auxiliary-index-handling.patch (2.4K, 5-v25-0010-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 75c1c8476e477229f1ba040ec2b3fcaad7d52db5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v25 10/12] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b95dda427c6..75607a34cf2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2934,6 +2934,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 0edf54e852d..09b9b811def 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.48.1



  [text/x-patch] v25-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 6-v25-0009-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 1b9cd8296f14a5ea752a0e1b2eb1c897e2b783c8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v25 09/12] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 30db079c8d8..cf14f474946 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 4ed3c969012..d62791ff9c3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 283c87f4327..b95dda427c6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2634,7 +2647,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2695,7 +2709,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3870,6 +3885,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3926,6 +3942,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4214,7 +4243,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4303,13 +4333,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4335,18 +4382,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 8c721e20992..7b3d4b19288 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3671,6 +3672,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4020,6 +4022,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4027,6 +4030,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4100,12 +4104,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4115,6 +4124,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4136,10 +4146,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4320,7 +4338,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4343,6 +4362,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4561,6 +4583,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4612,6 +4636,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index fc89352b661..0cc88d3064f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1533,6 +1533,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1593,9 +1595,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1647,6 +1660,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1675,12 +1716,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1cd036a0594..53e15502ec1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index a3e85ba1310..85cd088d080 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.48.1



  [text/x-patch] v25-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.4K, 7-v25-0008-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From f775a6a15aa6f1f400442bef08ea0c2722802407 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v25 08/12] Use auxiliary indexes for concurrent index
 operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 545 +++++++++++++--------
 src/backend/catalog/index.c                | 313 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 334 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  20 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1116 insertions(+), 339 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3f4a27a736e..5c48e529e4a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6327,6 +6327,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6367,13 +6379,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6390,8 +6401,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index b9c679c41e8..30db079c8d8 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c4055397146..4ed3c969012 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..6f718feb6d5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 42748c01a49..3cac122f7a7 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1781,243 +1782,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9b09d052b0c..283c87f4327 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2470,7 +2633,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2530,7 +2694,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3307,12 +3472,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3322,18 +3496,21 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
  * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3341,12 +3518,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3364,22 +3543,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3412,6 +3595,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3436,15 +3620,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3467,27 +3691,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3496,6 +3723,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3556,6 +3784,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3827,6 +4060,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4069,6 +4309,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4094,6 +4335,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c77fa0234bb..88d94e7ced9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1291,16 +1291,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dda1eb0e94c..8c721e20992 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,38 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We build the index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1721,9 +1771,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1741,24 +1810,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1785,7 +1844,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1810,6 +1869,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3564,6 +3670,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3669,8 +3776,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3722,8 +3836,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3784,6 +3905,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3887,15 +4015,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3946,6 +4077,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3959,12 +4095,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3973,6 +4114,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3991,10 +4133,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4075,13 +4221,56 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4124,6 +4313,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4131,12 +4355,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4174,7 +4392,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4203,7 +4421,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4293,14 +4511,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4325,6 +4543,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4338,11 +4578,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4362,6 +4602,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8f5aa0d7146..5bc16f07a86 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -718,7 +718,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1829,19 +1830,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  snapshot,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 4daa8bef5ee..d51b4e8cd13 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 98e68e972be..a3e85ba1310 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 35e8aad7701..ae3bfc3688e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,14 +2050,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.48.1



  [text/x-patch] v25-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch (41.2K, 8-v25-0004-Support-snapshot-resets-in-parallel-concurrent-i.patch)
  download | inline diff:
From a63d3b74d04e6904b12d998a1ef3b1887f6b2d64 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Wed, 1 Jan 2025 15:25:20 +0100
Subject: [PATCH v25 04/12] Support snapshot resets in parallel concurrent
 index builds

Extend periodic snapshot reset support to parallel builds, previously limited to non-parallel operations. This allows the xmin horizon to advance during parallel concurrent index builds as well.

The main limitation of applying that technic to parallel builds was a requirement to wait until workers processes restore their initial snapshot from leader.

To address this, following changes applied:
- add infrastructure to track snapshot restoration in parallel workers
- extend parallel scan initialization to support periodic snapshot resets
- wait for parallel workers to restore their initial snapshots before proceeding with scan
- relax limitation for parallel worker to call GetLatestSnapshot
---
 src/backend/access/brin/brin.c                | 50 +++++++++-------
 src/backend/access/gin/gininsert.c            | 50 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 ++--
 src/backend/access/nbtree/nbtsort.c           | 57 ++++++++++++++-----
 src/backend/access/table/tableam.c            | 37 ++++++++++--
 src/backend/access/transam/parallel.c         | 50 ++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 +--
 .../expected/cic_reset_snapshots.out          | 25 +++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 14 files changed, 225 insertions(+), 89 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 186edd0d229..5554cfa6f4d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -1221,7 +1220,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1254,7 +1252,6 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -1269,6 +1266,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = idxtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
@@ -2368,7 +2366,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2399,25 +2396,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2457,8 +2454,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2483,7 +2478,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2529,7 +2525,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2545,6 +2540,13 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2553,7 +2555,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2576,9 +2579,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2778,14 +2778,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -2807,6 +2807,7 @@ _brin_leader_participate_as_worker(BrinBuildState *buildstate, Relation heap, Re
 	/* Perform work common to all participants */
 	_brin_parallel_scan_and_build(buildstate, brinleader->brinshared,
 								  brinleader->sharedsort, heap, index, sortmem, true);
+	Assert(!brinleader->brinshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2947,6 +2948,13 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_brin_parallel_scan_and_build(buildstate, brinshared, sharedsort,
 								  heapRel, indexRel, sortmem, false);
+	if (brinshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2f947d36619..bf26106aa5e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -132,7 +132,6 @@ typedef struct GinLeader
 	 */
 	GinBuildShared *ginshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } GinLeader;
@@ -180,7 +179,7 @@ typedef struct
 static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 								bool isconcurrent, int request);
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
-static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _gin_parallel_estimate_shared(Relation heap);
 static double _gin_parallel_heapscan(GinBuildState *state);
 static double _gin_parallel_merge(GinBuildState *state);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
@@ -717,7 +716,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -741,7 +739,6 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
-		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -771,6 +768,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
@@ -905,7 +903,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estginshared;
 	Size		estsort;
 	GinBuildShared *ginshared;
@@ -935,25 +932,25 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
 	 */
-	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	estginshared = _gin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -993,8 +990,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -1018,7 +1013,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1060,7 +1056,6 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 		ginleader->nparticipanttuplesorts++;
 	ginleader->ginshared = ginshared;
 	ginleader->sharedsort = sharedsort;
-	ginleader->snapshot = snapshot;
 	ginleader->walusage = walusage;
 	ginleader->bufferusage = bufferusage;
 
@@ -1076,6 +1071,13 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = ginleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * We need to wait until all workers imported initial snapshot.
+	 */
+	if (isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_gin_leader_participate_as_worker(buildstate, heap, index);
@@ -1084,7 +1086,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!isconcurrent)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1107,9 +1110,6 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(ginleader->snapshot))
-		UnregisterSnapshot(ginleader->snapshot);
 	DestroyParallelContext(ginleader->pcxt);
 	ExitParallelMode();
 }
@@ -1790,14 +1790,14 @@ _gin_parallel_merge(GinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * gin index build based on the snapshot its parallel scan will use.
+ * gin index build.
  */
 static Size
-_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_gin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
@@ -1820,6 +1820,7 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
 								 ginleader->sharedsort, heap, index,
 								 sortmem, true);
+	Assert(!ginleader->ginshared->isconcurrent || !TransactionIdIsValid(MyProc->xid));
 }
 
 /*
@@ -2179,6 +2180,13 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
+	if (ginshared->isconcurrent)
+	{
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+		Assert(!TransactionIdIsValid(MyProc->xid));
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e32ee739733..a7e16871af6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1235,14 +1235,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1304,8 +1303,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7b09ad878b7..53b7ddfff0e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -322,22 +322,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -486,8 +484,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
-		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -1421,6 +1418,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1438,12 +1436,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1451,6 +1458,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1511,7 +1523,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1538,7 +1550,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1614,6 +1627,13 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * Wait until all workers imported initial snapshot.
+	 */
+	if (reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1622,7 +1642,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!reset_snapshot)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1646,7 +1667,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
@@ -1896,6 +1917,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	SortCoordinate coordinate;
 	BTBuildState buildstate;
 	TableScanDesc scan;
+	ParallelTableScanDesc pscan;
 	double		reltuples;
 	IndexInfo  *indexInfo;
 
@@ -1950,11 +1972,15 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(btspool->index);
 	indexInfo->ii_Concurrent = btshared->isconcurrent;
-	scan = table_beginscan_parallel(btspool->heap,
-									ParallelTableScanFromBTShared(btshared));
+	pscan = ParallelTableScanFromBTShared(btshared);
+	scan = table_beginscan_parallel(btspool->heap, pscan);
 	reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo,
 									   true, progress, _bt_build_callback,
 									   &buildstate, scan);
+	InvalidateCatalogSnapshot();
+	if (pscan->phs_reset_snapshot)
+		PopActiveSnapshot();
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Execute this worker's part of the sort */
 	if (progress)
@@ -1990,4 +2016,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 	tuplesort_end(btspool->sortstate);
 	if (btspool2)
 		tuplesort_end(btspool2->sortstate);
+	Assert(!pscan->phs_reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	if (pscan->phs_reset_snapshot)
+		PushActiveSnapshot(GetTransactionSnapshot());
 }
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..8b33b6278ce 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -132,10 +132,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -144,21 +144,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize", NULL);
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -171,7 +186,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..065ea9d26f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -77,6 +77,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -305,6 +306,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -376,6 +381,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -491,6 +497,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -546,6 +565,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -661,6 +691,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -690,7 +724,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -734,9 +768,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1295,6 +1332,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1499,6 +1537,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f50221930fd..32b7e6311eb 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1533,7 +1533,7 @@ index_concurrently_build(Oid heapRelationId,
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..f16284d4d0d 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -371,7 +371,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 8e1a918f130..68ea98405bb 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -353,14 +353,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index f37be6d5690..a7362f7b43b 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index b5e0fb386c0..50441c58cea 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 71af14d1c31..613615c78cd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1140,7 +1140,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1762,9 +1763,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 948d1232aa0..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,30 +78,45 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.48.1



  [text/x-patch] v25-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 9-v25-0006-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 07df45a6d2ec1301f38b6bc71de0339af5fba979 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v25 06/12] Add STIR access method and flags related to
 auxiliary indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 331b4f2b916..d3451078176 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..d0276bf483b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3087,6 +3087,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3138,6 +3139,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0669407be0c..9b09d052b0c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3434,6 +3434,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 12b4f3fd36e..b747c6e7804 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ac62f6a6abd..0d0a0f8d73f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -75,6 +75,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 01eba3b5a19..0d29115f200 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..1cd036a0594 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 20bf9ea9cdf..fc116b84a28 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index a79325e8a2f..8e7c9de12bb 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5139,7 +5139,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5153,7 +5154,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5178,9 +5180,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5189,12 +5191,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5203,7 +5206,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.48.1



  [text/x-patch] v25-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch (46.2K, 10-v25-0003-Reset-snapshots-periodically-in-non-unique-non-p.patch)
  download | inline diff:
From f4871ccfe895664873e4788b9e28e164793186a3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 21:10:23 +0100
Subject: [PATCH v25 03/12] Reset snapshots periodically in non-unique
 non-parallel concurrent index builds

Long-living snapshots used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon. Commit d9d076222f5b attempted to allow VACUUM to ignore such snapshots to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces an alternative by periodically resetting the snapshot used during the first phase. By resetting the snapshot every N pages during the heap scan, it allows the xmin horizon to advance.

Currently, this technique is applied to:

- only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness
- non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a following commits
- non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, will be addressed in a following commits

A new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset "between" every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  19 +++-
 src/backend/access/gin/gininsert.c            |  21 ++++
 src/backend/access/gist/gistbuild.c           |   3 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/heap/heapam.c              |  45 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  30 ++++-
 src/backend/access/spgist/spginsert.c         |   2 +
 src/backend/catalog/index.c                   |  31 +++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/heapam.h                   |   2 +
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 105 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 20 files changed, 428 insertions(+), 35 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 2445f001700..25a32a13565 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -558,7 +558,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index b5de68b7232..331b4f2b916 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -335,7 +335,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e462..186edd0d229 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1216,11 +1216,12 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		state->bs_sortstate =
 			tuplesort_begin_index_brin(maintenance_work_mem, coordinate,
 									   TUPLESORT_NONE);
-
+		InvalidateCatalogSnapshot();
 		/* scan the relation and merge per-worker results */
 		reltuples = _brin_parallel_merge(state);
 
 		_brin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -1233,6 +1234,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   brinbuildCallback, state, NULL);
 
+		InvalidateCatalogSnapshot();
 		/*
 		 * process the final batch
 		 *
@@ -1252,6 +1254,7 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		brin_fill_empty_ranges(state,
 							   state->bs_currRangeStart,
 							   state->bs_maxRangeStart);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	/* release resources */
@@ -2374,6 +2377,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2399,9 +2403,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2444,6 +2455,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2523,6 +2536,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2539,6 +2554,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e9d4b27427e..2f947d36619 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -28,6 +28,7 @@
 #include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "tcop/tcopprot.h"
 #include "utils/datum.h"
 #include "utils/memutils.h"
@@ -646,6 +647,8 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_ParallelWorkers || !TransactionIdIsValid(MyProc->xid));
+
 	/* Report table scan phase started */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
@@ -708,11 +711,13 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			tuplesort_begin_index_gin(heap, index,
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
+		InvalidateCatalogSnapshot();
 
 		/* scan the relation in parallel and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
 
 		_gin_end_parallel(state->bs_leader, state);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 	else						/* no parallel index build */
 	{
@@ -722,6 +727,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
 										   ginBuildCallback, &buildstate, NULL);
+		InvalidateCatalogSnapshot();
 
 		/* dump remaining entries to the index */
 		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -735,6 +741,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 						   list, nlist, &buildstate.buildStats);
 		}
 		MemoryContextSwitchTo(oldCtx);
+		Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 	}
 
 	MemoryContextDelete(buildstate.funcCtx);
@@ -907,6 +914,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -931,9 +939,16 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
@@ -976,6 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1050,6 +1067,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_gin_end_parallel(ginleader, NULL);
 		return;
 	}
@@ -1066,6 +1085,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9b2ec9815f1..bfc27474433 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -43,6 +43,7 @@
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -259,6 +260,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 	if (buildstate.buildMode == GIST_SORTED_BUILD)
 	{
 		/*
@@ -350,6 +352,7 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = (double) buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 53061c819fb..3711baea052 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -197,6 +197,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xid));
 
 	return result;
 }
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..d73968475c0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/spccache.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -633,6 +634,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective", NULL);
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -674,7 +705,12 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1336,6 +1372,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..e32ee739733 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1194,6 +1194,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1228,9 +1230,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1240,6 +1239,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1248,24 +1256,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1279,6 +1304,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1293,6 +1320,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1728,6 +1762,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1800,7 +1836,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 0cb27af1310..c9c53044748 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -464,7 +464,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..7b09ad878b7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -322,18 +322,22 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_ParallelWorkers && indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -481,6 +485,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	else
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
+	InvalidateCatalogSnapshot();
+	Assert(indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique ||
+		  !indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -536,7 +543,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 {
 	BTWriteState wstate;
 
@@ -558,18 +565,21 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
 	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
+	InvalidateCatalogSnapshot();
 
 	/* reserve the metapage */
 	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
+	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1410,6 +1420,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1446,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1509,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1605,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1623,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 6a61e093fa0..06c01cf3360 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -24,6 +24,7 @@
 #include "nodes/execnodes.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -143,6 +144,7 @@ spgbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	result = (IndexBuildResult *) palloc0(sizeof(IndexBuildResult));
 	result->heap_tuples = reltuples;
 	result->index_tuples = buildstate.indtuples;
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	return result;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..f50221930fd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -80,6 +80,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1492,8 +1493,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1511,19 +1512,29 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
 
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1534,12 +1545,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3236,7 +3254,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3299,12 +3318,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b10429c3721..a7994652ead 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,23 +1693,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4106,9 +4100,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4123,7 +4114,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 41bd8353430..2a25bb0654a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -63,6 +63,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6927,6 +6928,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6982,6 +6984,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -7039,6 +7046,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..8b3ec6430ad 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -43,6 +43,8 @@
 #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW		(1 << 0)
 #define HEAP_PAGE_PRUNE_FREEZE				(1 << 1)
 
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE		4096
+
 typedef struct BulkInsertStateData *BulkInsertState;
 typedef struct GlobalVisState GlobalVisState;
 typedef struct TupleTableSlot TupleTableSlot;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..71af14d1c31 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -25,6 +25,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -63,6 +64,17 @@ typedef enum ScanOptions
 
 	/* unregister snapshot at scan end? */
 	SO_TEMP_SNAPSHOT = 1 << 9,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 10,
 }			ScanOptions;
 
 /*
@@ -899,7 +911,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -907,6 +920,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots", NULL);
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1739,6 +1761,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index fc82cd67f6c..f4a62ed1ca7 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -11,7 +11,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points hashagg reindex_conc vacuum
+REGRESS = injection_points hashagg reindex_conc vacuum cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace syscache-update-pruned
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..948d1232aa0
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,105 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 20390d6b4bf..ba7bc0cc384 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -38,6 +38,7 @@ tests += {
       'hashagg',
       'reindex_conc',
       'vacuum',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.project_build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.48.1



  [text/x-patch] v25-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch (39.7K, 11-v25-0005-Support-snapshot-resets-in-concurrent-builds-of-.patch)
  download | inline diff:
From 411f295c94168eabc8d6f26410d104c9721ebb5b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Thu, 6 Mar 2025 14:54:44 +0100
Subject: [PATCH v25 05/12] Support snapshot resets in concurrent builds of
 unique indexes

Previously, concurrent builds if unique index used a fixed snapshot for the entire scan to ensure proper uniqueness checks.

Now reset snapshots periodically during concurrent unique index builds, while still maintaining uniqueness by:
- ignoring SnapshotSelf dead tuples during uniqueness checks in tuplesort as not a guarantee, but a fail-fast mechanics
- adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values as a guarantee of correctness

Tuples are SnapshotSelf tested only in the case of equal index key values, overwise _bt_load works like before.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 192 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  31 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  71 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 266 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a7e16871af6..42748c01a49 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1236,15 +1236,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index ab0b6946cb0..9a9ee55ff1b 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -149,7 +149,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -375,7 +375,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -790,12 +790,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 53b7ddfff0e..ee94ab509e7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -84,6 +84,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -102,6 +103,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -204,15 +206,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -259,7 +259,7 @@ static double _bt_spools_heapscan(Relation heap, Relation index,
 static void _bt_spooldestroy(BTSpool *btspool);
 static void _bt_spool(BTSpool *btspool, ItemPointer self,
 					  Datum *values, bool *isnull);
-static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots);
+static void _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent);
 static void _bt_build_callback(Relation index, ItemPointer tid, Datum *values,
 							   bool *isnull, bool tupleIsAlive, void *state);
 static BulkWriteBuffer _bt_blnewpage(BTWriteState *wstate, uint32 level);
@@ -304,8 +304,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -322,20 +320,20 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 			 RelationGetRelationName(index));
 
 	reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Finish the build by (1) completing the sort of the spool file, (2)
 	 * inserting the sorted tuples into btree pages and (3) building the upper
 	 * levels.  Finally, it may also be necessary to end use of parallelism.
 	 */
-	_bt_leafbuild(buildstate.spool, buildstate.spool2, !indexInfo->ii_Unique && indexInfo->ii_Concurrent);
+	_bt_leafbuild(buildstate.spool, buildstate.spool2, indexInfo->ii_Concurrent);
 	_bt_spooldestroy(buildstate.spool);
 	if (buildstate.spool2)
 		_bt_spooldestroy(buildstate.spool2);
 	if (buildstate.btleader)
 		_bt_end_parallel(buildstate.btleader);
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
@@ -382,6 +380,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+	/*
+	 * We need to ignore dead tuples for unique checks in case of concurrent build.
+	 * It is required because or periodic reset of snapshot.
+	 */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -430,8 +433,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -439,8 +443,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -471,7 +479,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -484,7 +492,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		reltuples = _bt_parallel_heapscan(buildstate,
 										  &indexInfo->ii_BrokenHotChain);
 	InvalidateCatalogSnapshot();
-	Assert(!indexInfo->ii_Concurrent || indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!indexInfo->ii_Concurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Set the progress target for the next phase.  Reset the block number
@@ -540,7 +548,7 @@ _bt_spool(BTSpool *btspool, ItemPointer self, Datum *values, bool *isnull)
  * create an entire btree.
  */
 static void
-_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
+_bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool isconcurrent)
 {
 	BTWriteState wstate;
 
@@ -562,7 +570,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 									 PROGRESS_BTREE_PHASE_PERFORMSORT_2);
 		tuplesort_performsort(btspool2->sortstate);
 	}
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
@@ -576,7 +584,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2, bool reset_snapshots)
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
 								 PROGRESS_BTREE_PHASE_LEAF_LOAD);
-	Assert(!reset_snapshots || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!isconcurrent || !TransactionIdIsValid(MyProc->xmin));
 	_bt_load(&wstate, btspool, btspool2);
 }
 
@@ -1155,13 +1163,117 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1321,7 +1433,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1418,7 +1530,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1436,21 +1547,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1458,16 +1560,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1537,6 +1639,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1551,7 +1654,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1631,7 +1734,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case of concurrent build snapshots are going to be reset periodically.
 	 * Wait until all workers imported initial snapshot.
 	 */
-	if (reset_snapshot)
+	if (isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, true);
 
 	/* Join heap scan ourselves */
@@ -1642,7 +1745,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	if (!reset_snapshot)
+	if (!isconcurrent)
 		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
@@ -1745,6 +1848,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1848,11 +1952,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1932,6 +2037,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1954,14 +2060,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index b88c396195a..ed5425ac6ec 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -688,7 +688,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -719,7 +719,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -968,7 +968,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -989,7 +989,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1028,7 +1028,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1150,7 +1150,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 41b4fbd1c37..3fff5f45a9d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -68,8 +68,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool forcenonrequired, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -2423,7 +2421,7 @@ _bt_set_startikey(IndexScanDesc scan, BTReadPageState *pstate)
 	lasttup = (IndexTuple) PageGetItem(pstate->page, iid);
 
 	/* Determine the first attribute whose values change on caller's page */
-	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup);
+	firstchangingattnum = _bt_keep_natts_fast(rel, firsttup, lasttup, NULL);
 
 	for (; startikey < so->numberOfKeys; startikey++)
 	{
@@ -3859,7 +3857,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -3977,17 +3975,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4013,6 +4018,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4032,7 +4039,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4043,7 +4050,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4052,6 +4060,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4060,7 +4070,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4077,6 +4088,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 32b7e6311eb..0669407be0c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1533,7 +1533,7 @@ index_concurrently_build(Oid heapRelationId,
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3324,9 +3324,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a7994652ead..dda1eb0e94c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1693,8 +1693,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 890cdbe1204..1ce2e2ad63c 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -24,6 +24,8 @@
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
+#include "access/relscan.h"
+#include "access/tableam.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "executor/executor.h"
@@ -33,6 +35,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -134,6 +137,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -359,6 +363,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -401,6 +406,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1654,6 +1660,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1663,18 +1670,58 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ab467cb8fd..0c9f0e1f3a6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1340,8 +1340,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 613615c78cd..8f5aa0d7146 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1763,9 +1763,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..eb9bc30e5da 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -429,6 +429,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.48.1



  [text/x-patch] v25-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 12-v25-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From bad826bea3424e91f38b05262157b0ae5743723d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v25 01/12] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca2bde62e82..b10429c3721 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index ca33a854278..0edf54e852d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1f2da072632..f77fe42a2a9 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4c5647ac38a..f6d2a6ede93 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index da5d901ec3c..d0c4386f798 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -803,12 +803,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -843,8 +845,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -856,30 +858,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -902,7 +950,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -922,27 +976,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -962,7 +1012,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -970,6 +1020,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1007,27 +1061,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1035,7 +1097,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.48.1



  [text/x-patch] v25-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 13-v25-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 70b8e6147cebcc427b4df419cac7cc7f9056973b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v25 02/12] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..2aad0e8daa8
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.48.1



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-09 18:02                                       ` Mihail Nikalayeu <[email protected]>
  2025-11-22 17:08                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-09 18:02 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Hello!

This is a rebased version.

Also I decided to keep only part 3 for now, because we need some
common solution to keep the horizon advance for both INDEX and REPACK
operations [0].
More complex solution description and benchmark results are available at [3].

PART 3
STIR-based validation phase CIC

That part is about a way to replace the second phase of CIC in a more
effective way (and with the ability to allow horizon advance as an
additional bonus).

The role of the second phase is to find tuples which are not present
in the index built by the first scan, because:
- some of them were too new for the snapshot used during the first phase
- even if we were to use SnapshotSelf to accept all alive tuples –
some of them may be inserted in pages already visited by the scan

The main idea is:
- before starting the first scan lets prepare a special auxiliary
super-lightweight index (it is not even an index or access method,
just pretends to be) with the same columns, expressions and predicates
- that access method (Short Term Index Replacement – STIR) just
appends TID of new coming tuples, without WAL, minimum locking,
simplest append-only structure, without actual indexed data
- it remembers all new TIDs inserted to the table during the first phase
- once our main (target) index receives updates itself we may safely
clear "ready" flag on STIR
- if our first phase scan missed something – it is guaranteed to be
present in that STIR index
- so, instead of requirement to compare the whole table to the index,
we need only to compare to TIDs stored in the STIR
- as a bonus we may reset snapshots during the comparison without risk
of any issues caused by HOT pruning (the issue [2] caused revert of
[1]).

That approach provides a significant performance boost in terms of
time required to build the index. STIR itself theoretically causes
some performance impact, but I was not able to detect it. Also, some
optimizations are applied to it (see below). Details of benchmarks are
presented below as well.

Commits are:
- Add STIR access method and flags related to auxiliary indexes

This one adds STIR code and some flags to distinguish real and
auxiliary indexes.

- Add Datum storage support to tuplestore

Add ability to store Datum in tuplestore. It is used by the following
commits to leverage performance boost from prefetching of the pages
during the validation phase.

- Use auxiliary indexes for concurrent index operations

The main part is here. It contains all the logic for creation of
auxiliary index, managing its lifecycle, new validation phase and so
on (including progress reporting, some documentation updates, ability
to have an unlogged index for logged tables, etc). At the same time it
still relies on a single referenced snapshot during the validation
phase.

- Track and drop auxiliary indexes in DROP/REINDEX

That commit adds different techniques to avoid any additional
administration requirements to deal with auxiliary indexes in case of
error during the index build (junk auxiliary indexes). It adds
dependency tracking, special logic for handling REINDEX calls and
other small things to make the administrator's life a little bit easier.

- Optimize auxiliary index handling

Since the STIR index does not contain any actual data we may skip
preparation of that during tuple insert. Commit implements such
optimization.

- Refresh snapshot periodically during index validation

Adds logic to the new validation phase to reset the snapshot every so
often. Currently it does it every 4096 pages visited.
Probably a caveat here is the requirement to call
InvalidateCatalogSnapshot to make sure xmin propagates.
But AFAIK the same may happen between transaction boundaries in CIC
anyway  - and ShareUpdateExclusiveLock on table is enough.


[0]: https://www.postgresql.org/message-id/flat/202510301734.pj4uds3mqxx4%40alvherre.pgsql#fd20662912580a...
[1]: https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d
[2]: https://www.postgresql.org/message-id/flat/20220524190133.j6ee7zh4f5edt5je%40alap3.anarazel.de#17814...
[3]: https://www.postgresql.org/message-id/[email protected]...

Best regards,
Mikhail.

From 9804077aca1920d08e1007a32387c72f0fdea7ff Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v26 5/8] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 544 ++++++++++++++-------
 src/backend/catalog/index.c                | 314 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1123 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2741c138593..868b025e2ed 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6407,6 +6407,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6447,13 +6459,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6470,8 +6481,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..ba387f28977 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..97f551a55a6 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..28e2a1604c4 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c85e5332ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1743,242 +1744,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e509a51c11..6a4b348dd1b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -759,6 +764,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +801,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1411,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1472,6 +1487,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2452,7 +2615,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2512,7 +2676,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3288,12 +3453,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3303,14 +3477,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3495,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3520,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,6 +3572,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3413,15 +3597,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3668,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,6 +3700,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3533,6 +3761,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4038,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4287,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4313,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 059e8778ca7..59b77ff7513 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1308,16 +1308,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 974243c5c60..9c34825e97d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,44 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1727,9 +1783,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1747,24 +1822,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1791,7 +1856,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1816,6 +1881,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3570,6 +3682,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3675,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3728,8 +3848,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3790,6 +3917,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3893,15 +4027,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3952,6 +4089,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3965,12 +4107,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3979,6 +4126,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3997,10 +4145,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4081,13 +4233,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4134,6 +4333,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4141,12 +4375,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4184,7 +4412,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4213,7 +4441,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4303,14 +4531,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4335,6 +4563,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4348,11 +4598,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4372,6 +4622,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..22446b32157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -706,7 +706,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1803,19 +1804,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index dda95e54903..c29f44f2465 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c743fc769cb..aa4fa76358a 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7c52181cbcb..917e4b208f8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2060,14 +2060,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0


From b5197137540c7301bfe11e61cc02cb6e17d84b8a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v26 1/8] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac3697..974243c5c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 401606f840a..df7e7bce86d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index aa12e9ad2ea..066686483f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4c5647ac38a..f6d2a6ede93 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd93002..ff416f0522c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -808,12 +808,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -848,8 +850,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -861,30 +863,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -907,7 +955,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -927,27 +981,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -967,7 +1017,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -975,6 +1025,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1012,27 +1066,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1040,7 +1102,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0


From 0fcccd4785b8015326d24ef8b7205b44704da9b3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v26 3/8] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index deb9a3dc0d1..0b6ffd6ec6e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3121,6 +3121,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3172,6 +3173,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..8e509a51c11 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 25089fae3e0..89721607f1f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd9f..431a2fae4ad 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -77,6 +77,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf9e12fcb9..feb75e0dc50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..84b32319fb3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a357e1d0c0e..c5595e788a4 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0


From 18998fb24440b0d5482d3a69ba54c03ba693c10f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v26 2/8] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..30376c548d4
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0


From b83b8aeb9cc76f3e1335cf7d04a754184da3c9ca Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v26 4/8] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0


From 93b583dc37207bddaf9153133ced27a498c7e760 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v26 6/8] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ba387f28977..bad0df105d2 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 97f551a55a6..025fe37a370 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6a4b348dd1b..fbebc6ed9ab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -775,6 +775,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1180,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1412,7 +1423,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1580,7 +1592,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2616,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2677,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3848,6 +3863,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3904,6 +3920,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4192,7 +4221,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4281,13 +4311,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4313,18 +4360,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9c34825e97d..ca4dc003d15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3683,6 +3684,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4032,6 +4034,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4039,6 +4042,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4112,12 +4116,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4127,6 +4136,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4148,10 +4158,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4340,7 +4358,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4363,6 +4382,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4581,6 +4603,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4632,6 +4656,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3aac459e483..d936d198e3a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 84b32319fb3..5896c61a918 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index aa4fa76358a..3ed8999d74f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0


From 9d7ecf55e9a3ba951be8cfc2621ae99e6d7c8037 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v26 7/8] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fbebc6ed9ab..2044b724ba5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2916,6 +2916,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index df7e7bce86d..ff47951560e 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0


From 400cd183ca955e989b1b9a2e2faf5df39d32e6f8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v26 8/8] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 25 ++++----
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 139 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 28e2a1604c4..604bdda59ff 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c85e5332ba2..12baa8728d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1996,23 +1996,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2023,14 +2026,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2046,6 +2051,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2079,6 +2107,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2134,6 +2163,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2143,9 +2186,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 71ef2e5036f..81406d8fc2b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2044b724ba5..2415e1f2f39 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3513,8 +3513,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3527,7 +3528,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3548,13 +3549,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3604,8 +3606,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3641,6 +3647,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3675,6 +3684,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3694,19 +3706,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3729,6 +3746,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca4dc003d15..75152d69b86 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1805,32 +1804,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1852,8 +1830,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4401,7 +4379,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4416,13 +4393,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4434,16 +4404,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4456,7 +4418,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 22446b32157..5fa60e8e37b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -702,12 +702,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1808,20 +1807,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index c9e20418275..b4a444a66e6 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c29f44f2465..051ac02ff9c 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



Attachments:

  [text/plain] v26-0005-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.4K, 2-v26-0005-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 9804077aca1920d08e1007a32387c72f0fdea7ff Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v26 5/8] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 544 ++++++++++++++-------
 src/backend/catalog/index.c                | 314 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1123 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2741c138593..868b025e2ed 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6407,6 +6407,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6447,13 +6459,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6470,8 +6481,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..ba387f28977 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..97f551a55a6 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..28e2a1604c4 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c85e5332ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1743,242 +1744,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e509a51c11..6a4b348dd1b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -759,6 +764,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +801,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1411,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1472,6 +1487,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2452,7 +2615,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2512,7 +2676,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3288,12 +3453,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3303,14 +3477,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3495,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3520,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,6 +3572,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3413,15 +3597,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3668,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,6 +3700,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3533,6 +3761,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4038,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4287,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4313,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 059e8778ca7..59b77ff7513 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1308,16 +1308,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 974243c5c60..9c34825e97d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,44 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1727,9 +1783,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1747,24 +1822,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1791,7 +1856,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1816,6 +1881,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3570,6 +3682,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3675,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3728,8 +3848,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3790,6 +3917,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3893,15 +4027,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3952,6 +4089,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3965,12 +4107,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3979,6 +4126,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3997,10 +4145,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4081,13 +4233,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4134,6 +4333,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4141,12 +4375,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4184,7 +4412,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4213,7 +4441,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4303,14 +4531,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4335,6 +4563,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4348,11 +4598,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4372,6 +4622,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..22446b32157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -706,7 +706,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1803,19 +1804,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index dda95e54903..c29f44f2465 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c743fc769cb..aa4fa76358a 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 7c52181cbcb..917e4b208f8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2060,14 +2060,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [text/plain] v26-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 3-v26-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From b5197137540c7301bfe11e61cc02cb6e17d84b8a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v26 1/8] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac3697..974243c5c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 401606f840a..df7e7bce86d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index aa12e9ad2ea..066686483f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4c5647ac38a..f6d2a6ede93 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd93002..ff416f0522c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -808,12 +808,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -848,8 +850,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -861,30 +863,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -907,7 +955,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -927,27 +981,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -967,7 +1017,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -975,6 +1025,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1012,27 +1066,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1040,7 +1102,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [text/plain] v26-0003-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 4-v26-0003-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 0fcccd4785b8015326d24ef8b7205b44704da9b3 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v26 3/8] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index deb9a3dc0d1..0b6ffd6ec6e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3121,6 +3121,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3172,6 +3173,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..8e509a51c11 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 25089fae3e0..89721607f1f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd9f..431a2fae4ad 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -77,6 +77,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5cf9e12fcb9..feb75e0dc50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..84b32319fb3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a357e1d0c0e..c5595e788a4 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [text/plain] v26-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.1K, 5-v26-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 18998fb24440b0d5482d3a69ba54c03ba693c10f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v26 2/8] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 223 ++++++++++++++++++++++++++++++++
 2 files changed, 224 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..30376c548d4
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,223 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [text/plain] v26-0004-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 6-v26-0004-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From b83b8aeb9cc76f3e1335cf7d04a754184da3c9ca Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v26 4/8] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [text/plain] v26-0006-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 7-v26-0006-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 93b583dc37207bddaf9153133ced27a498c7e760 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v26 6/8] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ba387f28977..bad0df105d2 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 97f551a55a6..025fe37a370 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6a4b348dd1b..fbebc6ed9ab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -775,6 +775,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1180,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1412,7 +1423,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1580,7 +1592,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2616,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2677,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3848,6 +3863,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3904,6 +3920,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4192,7 +4221,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4281,13 +4311,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4313,18 +4360,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9c34825e97d..ca4dc003d15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3683,6 +3684,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4032,6 +4034,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4039,6 +4042,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4112,12 +4116,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4127,6 +4136,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4148,10 +4158,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4340,7 +4358,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4363,6 +4382,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4581,6 +4603,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4632,6 +4656,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3aac459e483..d936d198e3a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 84b32319fb3..5896c61a918 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index aa4fa76358a..3ed8999d74f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [text/plain] v26-0007-Optimize-auxiliary-index-handling.patch (2.4K, 8-v26-0007-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 9d7ecf55e9a3ba951be8cfc2621ae99e6d7c8037 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v26 7/8] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fbebc6ed9ab..2044b724ba5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2916,6 +2916,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index df7e7bce86d..ff47951560e 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [text/plain] v26-0008-Refresh-snapshot-periodically-during-index-valid.patch (20.8K, 9-v26-0008-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 400cd183ca955e989b1b9a2e2faf5df39d32e6f8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v26 8/8] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 25 ++++----
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 139 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 28e2a1604c4..604bdda59ff 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c85e5332ba2..12baa8728d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1996,23 +1996,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2023,14 +2026,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2046,6 +2051,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2079,6 +2107,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2134,6 +2163,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2143,9 +2186,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 71ef2e5036f..81406d8fc2b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2044b724ba5..2415e1f2f39 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3513,8 +3513,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3527,7 +3528,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3548,13 +3549,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3604,8 +3606,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3641,6 +3647,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3675,6 +3684,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3694,19 +3706,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3729,6 +3746,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca4dc003d15..75152d69b86 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1805,32 +1804,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1852,8 +1830,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4401,7 +4379,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4416,13 +4393,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4434,16 +4404,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4456,7 +4418,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 22446b32157..5fa60e8e37b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -702,12 +702,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1808,20 +1807,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index c9e20418275..b4a444a66e6 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c29f44f2465..051ac02ff9c 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-22 17:08                                         ` Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-22 17:08 UTC (permalink / raw)
  To: Sergey Sargsyan <[email protected]>; +Cc: Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

Reduced memory used by stress-test to avoid OOM in CI.

Best regards,
Mikhail.

From 557b225d64d389489801d081016b7757ef3170fe Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v27 7/8] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fbebc6ed9ab..2044b724ba5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2916,6 +2916,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index df7e7bce86d..ff47951560e 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0


From b908d6ebc220a725fbc7d1e523cc4758d5716af9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v27 8/8] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 25 ++++----
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 139 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 28e2a1604c4..604bdda59ff 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c85e5332ba2..12baa8728d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1996,23 +1996,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2023,14 +2026,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2046,6 +2051,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2079,6 +2107,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2134,6 +2163,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2143,9 +2186,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 71ef2e5036f..81406d8fc2b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2044b724ba5..2415e1f2f39 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3513,8 +3513,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3527,7 +3528,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3548,13 +3549,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3604,8 +3606,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3641,6 +3647,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3675,6 +3684,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3694,19 +3706,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3729,6 +3746,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca4dc003d15..75152d69b86 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1805,32 +1804,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1852,8 +1830,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4401,7 +4379,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4416,13 +4393,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4434,16 +4404,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4456,7 +4418,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 22446b32157..5fa60e8e37b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -702,12 +702,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1808,20 +1807,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index c9e20418275..b4a444a66e6 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c29f44f2465..051ac02ff9c 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0


From 595322455a675e4f280cd6b58d5ec31b77934656 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v27 6/8] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ba387f28977..bad0df105d2 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 97f551a55a6..025fe37a370 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6a4b348dd1b..fbebc6ed9ab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -775,6 +775,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1180,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1412,7 +1423,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1580,7 +1592,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2616,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2677,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3848,6 +3863,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3904,6 +3920,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4192,7 +4221,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4281,13 +4311,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4313,18 +4360,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9c34825e97d..ca4dc003d15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3683,6 +3684,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4032,6 +4034,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4039,6 +4042,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4112,12 +4116,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4127,6 +4136,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4148,10 +4158,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4340,7 +4358,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4363,6 +4382,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4581,6 +4603,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4632,6 +4656,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 23ebaa3f230..5a6aa9afe32 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 84b32319fb3..5896c61a918 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index aa4fa76358a..3ed8999d74f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0


From 311a7020d07d813ce0418cf7f0ff7e05ff30e493 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v27 5/8] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 544 ++++++++++++++-------
 src/backend/catalog/index.c                | 314 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1123 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 436ef0e8bd0..1109ae23cc8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6408,6 +6408,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6448,13 +6460,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6471,8 +6482,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..ba387f28977 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..97f551a55a6 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..28e2a1604c4 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c85e5332ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1743,242 +1744,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e509a51c11..6a4b348dd1b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -759,6 +764,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +801,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1411,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1472,6 +1487,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2452,7 +2615,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2512,7 +2676,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3288,12 +3453,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3303,14 +3477,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3495,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3520,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,6 +3572,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3413,15 +3597,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3668,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,6 +3700,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3533,6 +3761,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4038,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4287,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4313,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 95ad29a64b9..68bc24bff62 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1308,16 +1308,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 974243c5c60..9c34825e97d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,44 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1727,9 +1783,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1747,24 +1822,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1791,7 +1856,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1816,6 +1881,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3570,6 +3682,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3675,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3728,8 +3848,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3790,6 +3917,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3893,15 +4027,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3952,6 +4089,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3965,12 +4107,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3979,6 +4126,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3997,10 +4145,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4081,13 +4233,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4134,6 +4333,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4141,12 +4375,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4184,7 +4412,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4213,7 +4441,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4303,14 +4531,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4335,6 +4563,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4348,11 +4598,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4372,6 +4622,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..22446b32157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -706,7 +706,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1803,19 +1804,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index dda95e54903..c29f44f2465 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c743fc769cb..aa4fa76358a 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 372a2188c22..c8dae3283b2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2060,14 +2060,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0


From 5f833d7cfebccf230203ff13d679688a3c46c2cf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v27 4/8] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0


From 8fae01e74d9f4d0e4637b06c1691fe194fdd13f9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v27 3/8] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 65bb0568a86..a5d30b822c3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3126,6 +3126,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3177,6 +3178,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..8e509a51c11 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 25089fae3e0..89721607f1f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd9f..431a2fae4ad 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -77,6 +77,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1edb18958f7..3d49891af33 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..84b32319fb3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a357e1d0c0e..c5595e788a4 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0


From c9f36bf4aca21b541899873c33f9849a693f5c5f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v27 1/8] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac3697..974243c5c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 401606f840a..df7e7bce86d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index aa12e9ad2ea..066686483f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 00429326c34..bac198de68d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd93002..ff416f0522c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -808,12 +808,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -848,8 +850,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -861,30 +863,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -907,7 +955,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -927,27 +981,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -967,7 +1017,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -975,6 +1025,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1012,27 +1066,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1040,7 +1102,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0


From f15922f8cbd87162ad48e96798d05bb1473f3d2e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v27 2/8] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 225 ++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..f160f9d18d7
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,225 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+# uncomment to force non-HOT -> $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



Attachments:

  [text/plain] v27-0007-Optimize-auxiliary-index-handling.patch (2.4K, 2-v27-0007-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 557b225d64d389489801d081016b7757ef3170fe Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v27 7/8] Optimize auxiliary index handling
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Skip unnecessary computations for auxiliary indices by:
- in the index‐insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c | 11 +++++++----
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fbebc6ed9ab..2044b724ba5 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2916,6 +2916,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index df7e7bce86d..ff47951560e 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -440,11 +440,14 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * In case of auxiliary index always pass false as optimisation.
 		 */
-		indexUnchanged = update && index_unchanged_by_update(resultRelInfo,
-															 estate,
-															 indexInfo,
-															 indexRelation);
+		indexUnchanged = update && likely(!indexInfo->ii_Auxiliary) &&
+									index_unchanged_by_update(resultRelInfo,
+															  estate,
+															  indexInfo,
+															  indexRelation);
 
 		satisfiesConstraint =
 			index_insert(indexRelation, /* index relation */
-- 
2.43.0



  [text/plain] v27-0008-Refresh-snapshot-periodically-during-index-valid.patch (20.8K, 3-v27-0008-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From b908d6ebc220a725fbc7d1e523cc4758d5716af9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v27 8/8] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach is not depends on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 73 +++++++++++++++++++++---
 src/backend/access/spgist/spgvacuum.c    | 12 +++-
 src/backend/catalog/index.c              | 42 ++++++++++----
 src/backend/commands/indexcmds.c         | 50 ++--------------
 src/include/access/tableam.h             | 25 ++++----
 src/include/access/transam.h             | 15 +++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 139 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 28e2a1604c4..604bdda59ff 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if them visible to reference snapshot.
+index, and inserts any missing ones if them visible to fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c85e5332ba2..12baa8728d5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1996,23 +1996,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
-	int				num_to_check;
+	int				num_to_check,
+					page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2023,14 +2026,16 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
-	/*
-	 * Encode TIDs as int8 values for the sort, rather than directly sorting
-	 * item pointers.  This can be significantly faster, primarily because TID
-	 * is a pass-by-reference type on all platforms, whereas int8 is
-	 * pass-by-value on most platforms.
-	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2046,6 +2051,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2079,6 +2107,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2134,6 +2163,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2143,9 +2186,21 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 71ef2e5036f..81406d8fc2b 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2044b724ba5..2415e1f2f39 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3513,8 +3513,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3527,7 +3528,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3548,13 +3549,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3604,8 +3606,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3641,6 +3647,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3675,6 +3684,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3694,19 +3706,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3729,6 +3746,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index ca4dc003d15..75152d69b86 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -591,7 +591,6 @@ DefineIndex(Oid tableId,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1805,32 +1804,11 @@ DefineIndex(Oid tableId,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1852,8 +1830,8 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4401,7 +4379,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4416,13 +4393,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4434,16 +4404,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4456,7 +4418,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 22446b32157..5fa60e8e37b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -702,12 +702,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1808,20 +1807,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index c9e20418275..b4a444a66e6 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index c29f44f2465..051ac02ff9c 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [text/plain] v27-0006-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.5K, 4-v27-0006-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 595322455a675e4f280cd6b58d5ec31b77934656 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v27 6/8] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  57 +++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  38 +++++++-
 src/backend/commands/tablecmds.c           |  48 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 367 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ba387f28977..bad0df105d2 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>ccaux</literal>
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 97f551a55a6..025fe37a370 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -477,11 +477,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>ccaux</literal> recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 7dded634eb8..b579d26aff2 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -286,7 +286,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6a4b348dd1b..fbebc6ed9ab 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -775,6 +775,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1180,6 +1182,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1412,7 +1423,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1580,7 +1592,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2616,7 +2629,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2677,7 +2691,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3848,6 +3863,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3904,6 +3920,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4192,7 +4221,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4281,13 +4311,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4313,18 +4360,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index c8b11f887e2..1c275ef373f 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -1035,6 +1035,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume any AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 9cc4f06da9f..3aa657c79cb 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9c34825e97d..ca4dc003d15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -245,7 +245,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -943,7 +943,8 @@ DefineIndex(Oid tableId,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3683,6 +3684,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4032,6 +4034,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4039,6 +4042,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4112,12 +4116,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4127,6 +4136,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4148,10 +4158,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4340,7 +4358,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * just indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4363,6 +4382,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure just index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4581,6 +4603,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4632,6 +4656,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 23ebaa3f230..5a6aa9afe32 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1532,6 +1532,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1592,9 +1594,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1646,6 +1659,34 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				PreventInTransactionBlock(true, "DROP INDEX CONCURRENTLY");
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1674,12 +1715,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index b556ba4817b..d7be8715d52 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 0ea7ccf5243..02bcf5e9315 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -180,6 +180,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 84b32319fb3..5896c61a918 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -218,6 +218,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 4904748f5fc..35745bc521c 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index aa4fa76358a..3ed8999d74f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3265,20 +3265,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 7ae8e44019b..6d597790b56 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1340,11 +1340,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [text/plain] v27-0005-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.4K, 5-v27-0005-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 311a7020d07d813ce0418cf7f0ff7e05ff30e493 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v27 5/8] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR  auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  41 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 544 ++++++++++++++-------
 src/backend/catalog/index.c                | 314 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1123 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 436ef0e8bd0..1109ae23cc8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6408,6 +6408,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6448,13 +6460,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6471,8 +6482,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..ba387f28977 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes is actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..97f551a55a6 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,14 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
+
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..28e2a1604c4 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..c85e5332ba2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1743,242 +1744,405 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxliary tuplesort but not is
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_off_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool shoud_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_off_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_off_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_off_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &shoud_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_off_offset_number =  ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (shoud_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_off_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			state->htups += 1;
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8e509a51c11..6a4b348dd1b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -714,11 +714,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -759,6 +764,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +790,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +801,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1397,7 +1411,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1472,6 +1487,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2452,7 +2615,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2512,7 +2676,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3288,12 +3453,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3303,14 +3477,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3318,12 +3495,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3341,22 +3520,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int64		main_work_mem_part = (int64) maintenance_work_mem * 8 / 10;
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3389,6 +3572,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3413,15 +3597,55 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3444,27 +3668,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3473,6 +3700,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3533,6 +3761,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3804,6 +4038,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4046,6 +4287,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4071,6 +4313,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 95ad29a64b9..68bc24bff62 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1308,16 +1308,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 974243c5c60..9c34825e97d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -181,6 +181,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -231,6 +232,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -242,7 +244,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -552,6 +555,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -561,6 +565,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -582,6 +587,7 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -832,6 +838,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -927,7 +942,8 @@ DefineIndex(Oid tableId,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1592,6 +1608,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1620,11 +1646,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1634,7 +1660,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1673,7 +1699,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1685,14 +1711,44 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1727,9 +1783,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1747,24 +1822,14 @@ DefineIndex(Oid tableId,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1791,7 +1856,7 @@ DefineIndex(Oid tableId,
 	 */
 	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1816,6 +1881,53 @@ DefineIndex(Oid tableId,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3570,6 +3682,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3675,8 +3788,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3728,8 +3848,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3790,6 +3917,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3893,15 +4027,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3952,6 +4089,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3965,12 +4107,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3979,6 +4126,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3997,10 +4145,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4081,13 +4233,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4134,6 +4333,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4141,12 +4375,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4184,7 +4412,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4213,7 +4441,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4303,14 +4531,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4335,6 +4563,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4348,11 +4598,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4372,6 +4622,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e97e0943f5b..b556ba4817b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..22446b32157 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -706,7 +706,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1803,19 +1804,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index dda95e54903..c29f44f2465 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 1cde4bd9bcf..9e93a4d9310 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -94,14 +94,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 5473ce9a288..4904748f5fc 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index c743fc769cb..aa4fa76358a 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3197,6 +3198,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3209,8 +3211,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3238,6 +3242,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 4d29fb85293..54b251b96ea 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 372a2188c22..c8dae3283b2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2060,14 +2060,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index eabc9623b20..7ae8e44019b 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1311,10 +1312,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1326,6 +1329,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [text/plain] v27-0004-Add-Datum-storage-support-to-tuplestore.patch (19.0K, 6-v27-0004-Add-Datum-storage-support-to-tuplestore.patch)
  download | inline diff:
From 5f833d7cfebccf230203ff13d679688a3c46c2cf Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 25 Jan 2025 13:33:21 +0100
Subject: [PATCH v27 4/8] Add Datum storage support to tuplestore

 Extend tuplestore to store individual Datum values:
- fixed-length datatypes: store raw bytes without a length header
- variable-length datatypes: include a length header and padding
- by-value types: store inline

This support enables usages tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 302 ++++++++++++++++++++++------
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 263 insertions(+), 72 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index c9aecab8d66..38076f3458e 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that atum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -143,12 +147,18 @@ struct Tuplestorestate
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
-	 * the already-read length of the stored tuple.  Create and return a
-	 * palloc'd copy, and decrease state->availMem by the amount of memory
-	 * space consumed.
+	 * the already-known (read of constant) length of the stored tuple.
+	 * Create and return a palloc'd copy, and decrease state->availMem by the
+	 * amount of memory space consumed.
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get lengh of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup)(Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
+ * In case of Datum with constant lenght both "unsigned int" are ommitted.
+ *
  * writetup is expected to write both length words as well as the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it not ommitted like in case of contant-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen =  0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,36 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +498,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +834,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1104,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			/* FALLTHROUGH */
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1059,7 +1136,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1167,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1229,25 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+	if (datum)
+	{
+		*result =datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1556,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1655,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1665,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1724,98 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length: stores raw bytes without length prefix
+ * - Variable-length: includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeLen > 0)
+		return state->datumTypeLen;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+	else
+	{
+		Datum d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+		return DatumGetPointer(d);
+	}
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void* datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Assert(state->datumTypeLen > 0);
+		BufFileWrite(state->myfile, datum, state->datumTypeLen);
+	}
+	else
+	{
+		Size size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+		{
+			BufFileWrite(state->myfile, &size, sizeof(size));
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		}
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void*
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = PointerGetDatum(NULL);
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+		return DatumGetPointer(datum);
+	}
+	else
+	{
+		Datum *datums = palloc(len);
+		BufFileReadExact(state->myfile, &datums, len);
+
+		/* need trailing length word? */
+		if (state->backward && state->datumTypeLen < 0)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return DatumGetPointer(*datums);
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 865ba7b8265..0341c47b851 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											 bool randomAccess,
+											 bool interXact,
+											 int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [text/plain] v27-0003-Add-STIR-access-method-and-flags-related-to-auxi.patch (37.3K, 7-v27-0003-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 8fae01e74d9f4d0e4637b06c1691fe194fdd13f9 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v27 3/8] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR(Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 581 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 786 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 65bb0568a86..a5d30b822c3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3126,6 +3126,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3177,6 +3178,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 7a2d0ddb689..a156cddff35 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..2e083d952d8
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,581 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	START_CRIT_SECTION();
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage =  BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5d9db167e59..8e509a51c11 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3411,6 +3411,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 874a8fc89ad..9cc4f06da9f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 25089fae3e0..89721607f1f 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -719,6 +719,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..582db77ddc0 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index e2d9e9be41a..e97e0943f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd9f..431a2fae4ad 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -77,6 +77,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index a604a4702c3..3127731f9c6 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 26d15928a15..a5ecf9208ad 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index 4a9624802aa..6227c5658fc 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index f7dcb96b43c..838ad32c932 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 1edb18958f7..3d49891af33 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..84b32319fb3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -155,8 +155,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise
  * ----------------
  */
 typedef struct IndexInfo
@@ -216,7 +216,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 6c64db6d456..e0d939d6857 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index a357e1d0c0e..c5595e788a4 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2122,9 +2122,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [text/plain] v27-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch (23.2K, 8-v27-0001-This-is-https-commitfest.postgresql.org-50-5160-.patch)
  download | inline diff:
From c9f36bf4aca21b541899873c33f9849a693f5c5f Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:09:52 +0100
Subject: [PATCH v27 1/8] This is https://commitfest.postgresql.org/50/5160/
 and https://commitfest.postgresql.org/patch/5438/ merged in single commit. it
 is required for stability of stress tests.

---
 contrib/amcheck/verify_nbtree.c        |  68 ++++++-------
 src/backend/commands/indexcmds.c       |   4 +-
 src/backend/executor/execIndexing.c    |   3 +
 src/backend/executor/execPartition.c   | 119 +++++++++++++++++++---
 src/backend/executor/nodeModifyTable.c |   2 +
 src/backend/optimizer/util/plancat.c   | 135 ++++++++++++++++++-------
 src/backend/utils/time/snapmgr.c       |   2 +
 7 files changed, 245 insertions(+), 88 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 0949c88983a..2445f001700 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -382,7 +382,6 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 	BTMetaPageData *metad;
 	uint32		previouslevel;
 	BtreeLevel	current;
-	Snapshot	snapshot = SnapshotAny;
 
 	if (!readonly)
 		elog(DEBUG1, "verifying consistency of tree structure for index \"%s\"",
@@ -433,38 +432,35 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->heaptuplespresent = 0;
 
 		/*
-		 * Register our own snapshot in !readonly case, rather than asking
+		 * Register our own snapshot for heapallindexed, rather than asking
 		 * table_index_build_scan() to do this for us later.  This needs to
 		 * happen before index fingerprinting begins, so we can later be
 		 * certain that index fingerprinting should have reached all tuples
 		 * returned by table_index_build_scan().
 		 */
-		if (!state->readonly)
-		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 
-			/*
-			 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
-			 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
-			 * the entries it requires in the index.
-			 *
-			 * We must defend against the possibility that an old xact
-			 * snapshot was returned at higher isolation levels when that
-			 * snapshot is not safe for index scans of the target index.  This
-			 * is possible when the snapshot sees tuples that are before the
-			 * index's indcheckxmin horizon.  Throwing an error here should be
-			 * very rare.  It doesn't seem worth using a secondary snapshot to
-			 * avoid this.
-			 */
-			if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
-				!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
-									   snapshot->xmin))
-				ereport(ERROR,
-						(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
-						 errmsg("index \"%s\" cannot be verified using transaction snapshot",
-								RelationGetRelationName(rel))));
-		}
-	}
+		/*
+		 * GetTransactionSnapshot() always acquires a new MVCC snapshot in
+		 * READ COMMITTED mode.  A new snapshot is guaranteed to have all
+		 * the entries it requires in the index.
+		 *
+		 * We must defend against the possibility that an old xact
+		 * snapshot was returned at higher isolation levels when that
+		 * snapshot is not safe for index scans of the target index.  This
+		 * is possible when the snapshot sees tuples that are before the
+		 * index's indcheckxmin horizon.  Throwing an error here should be
+		 * very rare.  It doesn't seem worth using a secondary snapshot to
+		 * avoid this.
+		 */
+		if (IsolationUsesXactSnapshot() && rel->rd_index->indcheckxmin &&
+			!TransactionIdPrecedes(HeapTupleHeaderGetXmin(rel->rd_indextuple->t_data),
+								   state->snapshot->xmin))
+			ereport(ERROR,
+					(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
+					 errmsg("index \"%s\" cannot be verified using transaction snapshot",
+							RelationGetRelationName(rel))));
+}
 
 	/*
 	 * We need a snapshot to check the uniqueness of the index. For better
@@ -476,9 +472,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		state->indexinfo = BuildIndexInfo(state->rel);
 		if (state->indexinfo->ii_Unique)
 		{
-			if (snapshot != SnapshotAny)
-				state->snapshot = snapshot;
-			else
+			if (state->snapshot == InvalidSnapshot)
 				state->snapshot = RegisterSnapshot(GetTransactionSnapshot());
 		}
 	}
@@ -555,13 +549,12 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		/*
 		 * Create our own scan for table_index_build_scan(), rather than
 		 * getting it to do so for us.  This is required so that we can
-		 * actually use the MVCC snapshot registered earlier in !readonly
-		 * case.
+		 * actually use the MVCC snapshot registered earlier.
 		 *
 		 * Note that table_index_build_scan() calls heap_endscan() for us.
 		 */
 		scan = table_beginscan_strat(state->heaprel,	/* relation */
-									 snapshot,	/* snapshot */
+									 state->snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
@@ -569,7 +562,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
-		 * behaves in !readonly case.
+		 * behaves.
 		 *
 		 * It's okay that we don't actually use the same lock strength for the
 		 * heap relation as any other ii_Concurrent caller would in !readonly
@@ -578,7 +571,7 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 		 * that needs to be sure that there was no concurrent recycling of
 		 * TIDs.
 		 */
-		indexinfo->ii_Concurrent = !state->readonly;
+		indexinfo->ii_Concurrent = true;
 
 		/*
 		 * Don't wait for uncommitted tuple xact commit/abort when index is a
@@ -602,14 +595,11 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 								 state->heaptuplespresent, RelationGetRelationName(heaprel),
 								 100.0 * bloom_prop_bits_set(state->filter))));
 
-		if (snapshot != SnapshotAny)
-			UnregisterSnapshot(snapshot);
-
 		bloom_free(state->filter);
 	}
 
 	/* Be tidy: */
-	if (snapshot == SnapshotAny && state->snapshot != InvalidSnapshot)
+	if (state->snapshot != InvalidSnapshot)
 		UnregisterSnapshot(state->snapshot);
 	MemoryContextDelete(state->targetcontext);
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 5712fac3697..974243c5c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1789,6 +1789,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4228,7 +4229,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap", NULL);
 	StartTransactionCommand();
 
 	/*
@@ -4307,6 +4308,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 401606f840a..df7e7bce86d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -942,6 +943,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict", NULL);
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index aa12e9ad2ea..066686483f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -490,6 +490,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -701,6 +743,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -711,23 +755,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 00429326c34..bac198de68d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -70,6 +70,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1179,6 +1180,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative", NULL);
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index d950bd93002..ff416f0522c 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -808,12 +808,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -848,8 +850,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -861,30 +863,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -907,7 +955,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -927,27 +981,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -967,7 +1017,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -975,6 +1025,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -1012,27 +1066,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -1040,7 +1102,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 65561cc6bc3..8e1a918f130 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -123,6 +123,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -458,6 +459,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end", NULL);
 	}
 }
 
-- 
2.43.0



  [text/plain] v27-0002-Add-stress-tests-for-concurrent-index-builds.patch (9.3K, 9-v27-0002-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From f15922f8cbd87162ad48e96798d05bb1473f3d2e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v27 2/8] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 225 ++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 316ea0d40b8..7df15435fbb 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..f160f9d18d7
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,225 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+# uncomment to force non-HOT -> $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-27 18:40                                         ` Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-27 18:40 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Matthias van de Meent <[email protected]>

On Sun, 9 Nov 2025 at 19:02, Mihail Nikalayeu <[email protected]> wrote:
>
> Hello!
>
> This is a rebased version.
>
> Also I decided to keep only part 3 for now, because we need some
> common solution to keep the horizon advance for both INDEX and REPACK
> operations [0].

I'm not sure a complete and common approach is that easy between CIC
and REPACK CONCURRENTLY.

Specifically, indexes don't need to deal with the exact visibility
info of a tuple, and can let VACUUM take care of any false positives
(now-dead tuples), while REPACK does need to deal with all of that
that (xmin/xmax/xcid). Considering that REPACK is still going to rely
on primitives provided by logical replication, it would be not much
different from reducing the lifetime of the snapshots used by Logical
Replication's initial sync, and I'd rather not have to wait for that
to get implemented.

The only thing I can think of that might be shareable between the two
is the tooling in heapscan to every so often call into a function that
registers a new snapshot, but I think that's a comparatively minor
change on top of what was implemented for CIC, one that REPACK can
deal with on its own.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-11-27 18:59                                           ` Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-27 18:59 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Mathias!

On Thu, Nov 27, 2025 at 7:41 PM Matthias van de Meent
<[email protected]> wrote:
> I'm not sure a complete and common approach is that easy between CIC
> and REPACK CONCURRENTLY.

Yes, you're right, but I hope something like [0] may work.

[0]: https://www.postgresql.org/message-id/CADzfLwXN4NXv8C%2B8GzbMJvRaBkJMs838c92CM-6Js-%3DWpi5aRQ%40mail...





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-27 20:07                                             ` Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-27 20:07 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Thu, 27 Nov 2025 at 20:00, Mihail Nikalayeu
<[email protected]> wrote:
>
> Hello, Mathias!
>
> On Thu, Nov 27, 2025 at 7:41 PM Matthias van de Meent
> <[email protected]> wrote:
> > I'm not sure a complete and common approach is that easy between CIC
> > and REPACK CONCURRENTLY.
>
> Yes, you're right, but I hope something like [0] may work.

While it might not break, and might not hold back other tables'
visibility horizons, it'll still hold back pruning on the table we're
acting on, and that's likely one which already had bloat issues if
you're running RIC (or REPACK).
Hence the approach with properly taking a new snapshot every so often
in CIC/RIC -- that way pruning is allowed up to a relatively recent
point in every table, including the one we're acting on; potentially
saving us from a vicious cycle where RIC causes table bloat in the
table it's working on due to long-held snapshots and a high-churn
workload in that table.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

PS. When I checked the code you linked to on that thread, I noticed
there is a stale pointer dereference issue in
GetPinnedOldestNonRemovableTransactionId, where it pulls data from a
hash table entry that could've been released by that point.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-11-28 14:50                                               ` Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-28 14:50 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

On Thu, Nov 27, 2025 at 9:07 PM Matthias van de Meent
<[email protected]> wrote:
> While it might not break, and might not hold back other tables'
> visibility horizons, it'll still hold back pruning on the table we're
> acting on, and that's likely one which already had bloat issues if
> you're running RIC (or REPACK).

Yes, a good point about REPACK, agreed.

BTW, what is about using the same reset snapshot technique for REPACK also?

I thought it is impossible, but what if we:

* while reading the heap we "remember" our current page position into
shared memory
* preserve all xmin/max/cid into newly created repacked table (we need
it for MVCC-safe approach anyway)
* in logical decoding layer - we check TID of our tuple and looking at
"current page" we may correctly decide what to do with at apply phase:

- if it in "non-yet read pages" - ignore (we will read it later) - but
signal scan to ensure it will reset snapshot before that page
(reset_before = min(reset_before, tid))
- if it in "already read pages" - remember the apply operation (with
exact target xmin/xmax and resulting xmin/xmax)

 Before switching table - use the same "limit_xmin" logic to wait for
other transactions the same way CIC does.

It may involve some tricky locking, maybe I missed some cases, but it
feels like it is possible to do it correctly by combining information
of scan state and xmin/xmax/tid/etc...

PS.

> PS. When I checked the code you linked to on that thread, I noticed
> there is a stale pointer dereference issue in
> GetPinnedOldestNonRemovableTransactionId, where it pulls data from a
> hash table entry that could've been released by that point.

Thanks!





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-28 16:57                                                 ` Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-28 16:57 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, 28 Nov 2025 at 15:50, Mihail Nikalayeu
<[email protected]> wrote:
>
> Hello!
>
> On Thu, Nov 27, 2025 at 9:07 PM Matthias van de Meent
> <[email protected]> wrote:
> > While it might not break, and might not hold back other tables'
> > visibility horizons, it'll still hold back pruning on the table we're
> > acting on, and that's likely one which already had bloat issues if
> > you're running RIC (or REPACK).
>
> Yes, a good point about REPACK, agreed.
>
> BTW, what is about using the same reset snapshot technique for REPACK also?
>
> I thought it is impossible, but what if we:
>
> * while reading the heap we "remember" our current page position into
> shared memory
> * preserve all xmin/max/cid into newly created repacked table (we need
> it for MVCC-safe approach anyway)
> * in logical decoding layer - we check TID of our tuple and looking at
> "current page" we may correctly decide what to do with at apply phase:
>
> - if it in "non-yet read pages" - ignore (we will read it later) - but
> signal scan to ensure it will reset snapshot before that page
> (reset_before = min(reset_before, tid))
> - if it in "already read pages" - remember the apply operation (with
> exact target xmin/xmax and resulting xmin/xmax)

Yes, exactly - keep track of which snapshot was used for which part of
the table, and all updates that add/remove tuples from the scanned
range after that snapshot are considered inserts/deletes, similar to
how it'd work if LR had a filter on `ctid BETWEEN '(0, 0)' AND
'(end-of-snapshot-scan)'` which then gets updated every so often.

I'm a bit worried, though, that LR may lose updates due to commit
order differences between WAL and PGPROC. I don't know how that's
handled in logical decoding, and can't find much literature about it
in the repo either.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-11-28 17:58                                                   ` Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:31                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  1 sibling, 2 replies; 64+ messages in thread

From: Hannu Krosing @ 2025-11-28 17:58 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
<[email protected]> wrote:
>
...
> I'm a bit worried, though, that LR may lose updates due to commit
> order differences between WAL and PGPROC. I don't know how that's
> handled in logical decoding, and can't find much literature about it
> in the repo either.

Now the reference to logical decoding made me think that maybe to real
fix for CIC would be to leverage logical decoding for the 2nd pass of
CIC and not wore about in-page visibilities at all.

---
Hannu





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-11-28 18:05                                                     ` Hannu Krosing <[email protected]>
  2025-11-28 18:40                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  1 sibling, 2 replies; 64+ messages in thread

From: Hannu Krosing @ 2025-11-28 18:05 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, Nov 28, 2025 at 6:58 PM Hannu Krosing <[email protected]> wrote:
>
> On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> <[email protected]> wrote:
> >
> ...
> > I'm a bit worried, though, that LR may lose updates due to commit
> > order differences between WAL and PGPROC. I don't know how that's
> > handled in logical decoding, and can't find much literature about it
> > in the repo either.
>
> Now the reference to logical decoding made me think that maybe to real
> fix for CIC would be to leverage logical decoding for the 2nd pass of
> CIC and not worry about in-page visibilities at all.

And if we are concerned about having possibly to scan more WAL than we
would have had to scan the table, we can start a
tuple-to-index-collector immediately after starting the CIC.

For extra efficiency gains the collector itself should have two phases

1. While the first pass of CIC is collecting the visible tuple for
index the logical decoding collector also collects any new tuples
added after the CIC start.
2. When the first pass collection finishes, it also gets the indexes
collected so far by the logical decoding collectoir and adds them to
the first set before the sorting and creating the index.

3. once the initial index is created, the CIC just gets whatever else
was collected after 2. and adds these to the index

---
Hannu




>
> ---
> Hannu





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-11-28 18:40                                                       ` Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-28 18:40 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

On Fri, Nov 28, 2025 at 7:05 PM Hannu Krosing <[email protected]> wrote:
> 1. While the first pass of CIC is collecting the visible tuple for
> index the logical decoding collector also collects any new tuples
> added after the CIC start.
> 2. When the first pass collection finishes, it also gets the indexes
> collected so far by the logical decoding collectoir and adds them to
> the first set before the sorting and creating the index.
>
> 3. once the initial index is created, the CIC just gets whatever else
> was collected after 2. and adds these to the index

It feels very similar to the approach with STIR (upper in that thread)
- instead of doing the second scan - just collect all the new-coming
TIDs in short-term-index-replacement access method.

I think STIR lightweight AM (contains just TID) is a better option
here than logical replication due several reason (Mathias already
mentioned some of them).

Anyway, it looks like things\threads became a little bit mixed-up,
I'll try to structure it a little bit.

For CIC/RC approach with resetting snapshot during heap scan - it is
enough to achieve vacuum-friendly state in phase 1.
For phase 2 (validation) - we need an additional thing - something to
collect incoming tuples (STIR index AM is proposed). In that case we
achieve vacuum-friendly for both phases + single heap scan.

STIR at the same time may be used as just way to make CIC faster
(single scan) - without any improvements related to VACUUM.

You may check [0] for links.

Another topic is REPACK CONCURRENTLY, which itself leaves in [1]. It
is already based on LR.
I was talking about a way to use the same tech (reset snapshot during
the scan) for REPACK also, leveraging the already introduced LR
decoding part.

Mikhail.

[0]: https://www.postgresql.org/message-id/flat/CADzfLwWkYi3r-CD_Bbkg-Mx0qxMBzZZFQTL2ud7yHH2KDb1hdw%40mai...
[1]: https://www.postgresql.org/message-id/flat/202507262156.sb455angijk6%40alvherre.pgsql





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-12-01 10:29                                                       ` Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-12-01 10:29 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hannu Krosing <[email protected]> wrote:

> On Fri, Nov 28, 2025 at 6:58 PM Hannu Krosing <[email protected]> wrote:
> >
> > On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> > <[email protected]> wrote:
> > >
> > ...
> > > I'm a bit worried, though, that LR may lose updates due to commit
> > > order differences between WAL and PGPROC. I don't know how that's
> > > handled in logical decoding, and can't find much literature about it
> > > in the repo either.
> >
> > Now the reference to logical decoding made me think that maybe to real
> > fix for CIC would be to leverage logical decoding for the 2nd pass of
> > CIC and not worry about in-page visibilities at all.
> 
> And if we are concerned about having possibly to scan more WAL than we
> would have had to scan the table, we can start a
> tuple-to-index-collector immediately after starting the CIC.
> 
> For extra efficiency gains the collector itself should have two phases
> 
> 1. While the first pass of CIC is collecting the visible tuple for
> index the logical decoding collector also collects any new tuples
> added after the CIC start.
> 2. When the first pass collection finishes, it also gets the indexes
> collected so far by the logical decoding collectoir and adds them to
> the first set before the sorting and creating the index.
> 
> 3. once the initial index is created, the CIC just gets whatever else
> was collected after 2. and adds these to the index

The core problem here is that the snapshot you need for the first pass
restricts VACUUM on all tables in the database. The same problem exists for
REPACK (CONCURRENTLY) and we haven't resolved it yet.

With logical replication, we cannot really use multiple snapshots as Mihail is
proposing elsewhere in the thread, because the logical decoding system only
generates the snapshot for non-catalog tables once (LR uses that snapshot for
the initial table synchronization). Only snapshots for system catalog tables
are then built as the WAL decoding progresses. It can be worked around by
considering regular table as catalog during the processing, but it currently
introduces quite some overhead:

https://www.postgresql.org/message-id/178741.1743514291%40localhost

Perhaps we could enhance the logical decoding so that it gathers the
information needed to build snapshots (AFAICS it's mostly about the
XLOG_HEAP2_NEW_CID record) not only for catalog tables, but also for
particular non-catalog table(s). However, for these non-catalog tables, the
actual snapshot build should only take place when the snapshot is actually
needed. (For catalog tables, each data change triggers the build of a new
snapshot.)

So in general I agree with what you say elsewhere in the thread that it might
be worth to enhance the logical decoding a bit.

Transient enabling of the decoding, only for specific tables (i.e. not
requiring wal_level=logical), is another problem. I proposed a patch for that,
but not sure it has been reviewed yet:

https://www.postgresql.org/message-id/152010.1751307725%40localhost

(See the 0007 part.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-12-01 10:49                                                         ` Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-12-01 10:49 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Hannu Krosing <[email protected]>; Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Antonin!

On Mon, Dec 1, 2025 at 11:29 AM Antonin Houska <[email protected]> wrote:
> With logical replication, we cannot really use multiple snapshots as Mihail is
> proposing elsewhere in the thread, because the logical decoding system only
> generates the snapshot for non-catalog tables once (LR uses that snapshot for
> the initial table synchronization). Only snapshots for system catalog tables
> are then built as the WAL decoding progresses. It can be worked around by
> considering regular table as catalog during the processing, but it currently
> introduces quite some overhead:

My idea related to REPACK is a little bit different. I am not talking
about snapshots generated by LR - just GetLatestSnapshot.

> The core problem here is that the snapshot you need for the first pass
> restricts VACUUM on all tables in the database

We might use it only for a few seconds - it is required only to
*start* the scan (to ensure we will not miss anything in the table).
After we may throw it away and ask GetLatestSnapshot a fresh one for
next N pages. We just need to synchronize scan position in the table
and logical decoding.

The same is possible for CIC too. In that case we should do the same
and just store all incoming tuples the same way as STIR does it.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-12-02 07:28                                                           ` Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-12-02 07:28 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Hannu Krosing <[email protected]>; Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Mihail Nikalayeu <[email protected]> wrote:

> Hello, Antonin!
> 
> On Mon, Dec 1, 2025 at 11:29 AM Antonin Houska <[email protected]> wrote:
> > With logical replication, we cannot really use multiple snapshots as Mihail is
> > proposing elsewhere in the thread, because the logical decoding system only
> > generates the snapshot for non-catalog tables once (LR uses that snapshot for
> > the initial table synchronization). Only snapshots for system catalog tables
> > are then built as the WAL decoding progresses. It can be worked around by
> > considering regular table as catalog during the processing, but it currently
> > introduces quite some overhead:
> 
> My idea related to REPACK is a little bit different. I am not talking
> about snapshots generated by LR - just GetLatestSnapshot.
> 
> > The core problem here is that the snapshot you need for the first pass
> > restricts VACUUM on all tables in the database
> 
> We might use it only for a few seconds - it is required only to
> *start* the scan (to ensure we will not miss anything in the table).
> After we may throw it away and ask GetLatestSnapshot a fresh one for
> next N pages. We just need to synchronize scan position in the table
> and logical decoding.
> 
> The same is possible for CIC too. In that case we should do the same
> and just store all incoming tuples the same way as STIR does it.

I suppose you don't want to use logical decoding for CIC, do you? How can then
it be "the same" like in REPACK (CONCURRENTLY)? Or do you propose to rework
REPACK (CONCURRENTLY) from scratch so that it does not use logical decoding
either?

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-12-02 10:27                                                             ` Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-12-02 10:27 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Hannu Krosing <[email protected]>; Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Antonin!

On Tue, Dec 2, 2025 at 8:28 AM Antonin Houska <[email protected]> wrote:
> I suppose you don't want to use logical decoding for CIC, do you? How can then
> it be "the same" like in REPACK (CONCURRENTLY)? Or do you propose to rework
> REPACK (CONCURRENTLY) from scratch so that it does not use logical decoding
> either?

My logic here chain is next:
* looks like we may reuse snapshot reset technique for REPACK, using
LR+some tricks
* if it worked, why should we use reset technique + STIR (not LR too) in CIC?
* mostly because it is not possible to active LR for some of tables
* but there is (your) patch what aims to add the ability to activate
LR for any table
* if it worked - it feels natural to replace STIR by LR to keep things
looking the same and working the same way

While STIR may be more efficient and simple for CIC - it is still an
additional entity in the PG domain, so LR may be a better solution
from a system design perspective.

But it is only thought so far, because I have not yet proved reset
snapshot is possible for REPACK (need to do some POC at least).
What do you think?

Also, I think I'll extract reset-snapshot for CIC in a separate CF
entry, since it still may be used with or without either STIR or LR.

Best regards,
MIkhail,





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-12-02 11:12                                                               ` Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-12-02 11:12 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Tue, 2 Dec 2025 at 11:27, Mihail Nikalayeu <[email protected]> wrote:
>
> Hello, Antonin!
>
> On Tue, Dec 2, 2025 at 8:28 AM Antonin Houska <[email protected]> wrote:
> > I suppose you don't want to use logical decoding for CIC, do you? How can then
> > it be "the same" like in REPACK (CONCURRENTLY)? Or do you propose to rework
> > REPACK (CONCURRENTLY) from scratch so that it does not use logical decoding
> > either?
>
> My logic here chain is next:
> * looks like we may reuse snapshot reset technique for REPACK, using
> LR+some tricks
> * if it worked, why should we use reset technique + STIR (not LR too) in CIC?

Because it's more easy to reason about STIR than it is to reason about
LR, especially when it concerns things like "overhead in heavily
loaded systems".

For CIC, you know that the amount of IO required is proportional only
to the table's data. With LR, that guarantee is gone; concurrent
workloads may bloat the WAL that needs to be scanned to many times the
size of the data you didn't have to scan.

> * mostly because it is not possible to active LR for some of tables
> * but there is (your) patch what aims to add the ability to activate
> LR for any table


> * if it worked - it feels natural to replace STIR by LR to keep things
> looking the same and working the same way
>
> While STIR may be more efficient and simple for CIC - it is still an
> additional entity in the PG domain, so LR may be a better solution
> from a system design perspective.

LR is a very complicated system that depends on WAL and various other
subsystems to work; and has a significant amount of overhead.
I disagree with any work to make (concurrent) index creation depend on
WAL; it is _not_ the right approach. Don't shoe-horn this into that.

> But it is only thought so far, because I have not yet proved reset
> snapshot is possible for REPACK (need to do some POC at least).
> What do you think?

I don't think we should be worrying about REPACK here and now.

> Also, I think I'll extract reset-snapshot for CIC in a separate CF
> entry, since it still may be used with or without either STIR or LR.

Thanks!


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2026-03-09 00:09                                                                 ` Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-03-09 00:09 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Rebased.


Attachments:

  [application/x-patch] v30-0006-Optimize-auxiliary-index-handling.patch (2.1K, 2-v30-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 056bb42e7c7d6f2cdd867a58ecba1ffb514d846e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v30 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c |  5 ++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 91125d37150..ed563da5a32 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2914,6 +2914,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..ce76a213556 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && likely(!indexInfo->ii_Auxiliary) &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
-- 
2.43.0



  [application/x-patch] v30-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.3K, 3-v30-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From ff9edd2c5143d299879a1fb2aff3dab0c8b04420 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v30 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 559 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 113 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 760 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..f1785b9a456 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3138,6 +3138,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3189,6 +3190,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..e550d8892e6
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,559 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "miscadmin.h"
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about an index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 43de42ce39e..1325f3d9700 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3409,6 +3409,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index c78dcea98c1..87e01e74ad7 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -307,6 +307,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 53adac9139b..f81dd30df24 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -725,6 +725,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..dfdccfaf991 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -885,6 +885,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 2caec621d73..09ae445694d 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4c0429cc613..cd467582731 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -56,6 +56,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 0bd17b30ca7..e2966165e6f 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -52,8 +52,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..18ee36506fd
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "access/xlog.h"
+#include "access/generic_xlog.h"
+#include "access/itup.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 361e2cfffeb..f1475def487 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c067d5aae..6ce9154b28d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -157,8 +157,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -218,7 +218,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 6ff4d7ee901..9259679eea2 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2129,9 +2129,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/x-patch] v30-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.9K, 4-v30-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From caadba1e61ab6881831a64ce3d891829ddcb8208 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v30 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  42 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 548 ++++++++++++++-------
 src/backend/catalog/index.c                | 308 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1121 insertions(+), 335 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b3d53550688..ce08e0d8b10 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6643,6 +6643,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6683,13 +6695,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6706,8 +6717,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..1c3c7a97f6a 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -76,7 +76,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
       this index is left as <quote>invalid</quote>. Such indexes are useless
       but it can be convenient to use <command>REINDEX</command> to rebuild
       them. Note that only <command>REINDEX INDEX</command> is able
-      to perform a concurrent build on an invalid index.
+      to perform a concurrent build on a invalid index.
      </para>
     </listitem>
 
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3ff36f59bf8..4ad8a2c0f81 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1759,242 +1760,409 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int
+heapam_index_validate_tuplesort_difference(Tuplesortstate  *main,
+										   Tuplesortstate  *aux,
+										   Tuplestorestate *store)
+{
+	int				num = 0;
+	/* state variables for the merge */
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int				num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1325f3d9700..4f77627fb3b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -712,11 +712,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -757,6 +762,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -782,7 +788,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -790,6 +799,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1395,7 +1409,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1470,6 +1485,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2450,7 +2613,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2510,7 +2674,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3286,12 +3451,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3301,14 +3475,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3316,12 +3493,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3339,22 +3518,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3387,6 +3570,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3411,15 +3595,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3442,27 +3660,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3471,6 +3692,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3531,6 +3753,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3802,6 +4030,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4044,6 +4279,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4069,6 +4305,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ecb7c996e86..3e40ed2f439 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1350,16 +1350,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 635679cc1f2..f583239e091 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -182,6 +182,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -232,6 +233,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -243,7 +245,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -556,6 +559,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -565,6 +569,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -586,6 +591,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -836,6 +842,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -930,7 +945,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1600,6 +1616,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1628,11 +1654,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1642,7 +1668,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1681,7 +1707,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1693,14 +1719,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1735,9 +1791,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1755,24 +1830,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1799,7 +1864,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1824,6 +1889,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3595,6 +3707,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3700,8 +3813,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3753,8 +3873,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3815,6 +3942,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3918,15 +4052,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3977,6 +4114,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3990,12 +4132,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4004,6 +4151,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4022,10 +4170,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4106,13 +4258,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4159,6 +4358,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4166,12 +4400,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4209,7 +4437,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4238,7 +4466,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4329,14 +4557,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4361,6 +4589,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4374,11 +4624,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4398,6 +4648,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 09ae445694d..8cb2231a7a8 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..1a997537800 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -705,7 +705,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1824,19 +1825,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index b259c4141ed..37a390d33de 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -100,6 +102,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -145,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 359221dc296..841491a8511 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -111,14 +111,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 982ec25ae14..6cf45a68cbe 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index dc629928c8f..9b06ddc87a2 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index deb6e2ad6a9..11cb06cfb72 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2068,14 +2068,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/x-patch] v30-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.9K, 5-v30-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 8ff96b8f44ceccf429e3e635a806311b14fc26d7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v30 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  58 ++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 371 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..7f751453317 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 1c3c7a97f6a..384c5fc8b3f 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 09575278de3..74f8335888b 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -287,7 +287,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4f77627fb3b..91125d37150 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -773,6 +773,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1178,6 +1180,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1410,7 +1421,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1578,7 +1590,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2614,7 +2627,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2675,7 +2689,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3840,6 +3855,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3896,6 +3912,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4184,7 +4213,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4273,13 +4303,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4305,18 +4352,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..7e0e29bdb5b 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX and AM eq STIR is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 87e01e74ad7..c511563f3ff 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -308,6 +308,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f583239e091..599e3375833 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -246,7 +246,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -946,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3708,6 +3709,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4057,6 +4059,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4064,6 +4067,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4137,12 +4141,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4152,6 +4161,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4173,10 +4183,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4365,7 +4383,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4388,6 +4407,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4607,6 +4629,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4658,6 +4682,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 85242dcc245..1622dfb05ca 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1548,6 +1548,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1608,9 +1610,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1662,6 +1675,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1690,12 +1735,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 8cb2231a7a8..ccc1294e730 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6ce9154b28d..e8236eede00 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -220,6 +220,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 6cf45a68cbe..92dff90c3de 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/x-patch] v30-0007-Refresh-snapshot-periodically-during-index-valid.patch (21.5K, 6-v30-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 1f0083bc9b7a1b944dff98f425ef754262adaa25 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v30 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 65 +++++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c    | 12 +++--
 src/backend/catalog/index.c              | 51 ++++++++++++++-----
 src/backend/commands/indexcmds.c         | 50 +++---------------
 src/include/access/tableam.h             | 25 ++++-----
 src/include/access/transam.h             | 15 ++++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 146 insertions(+), 78 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 4ad8a2c0f81..0becbcc5b87 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2012,23 +2012,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int				num_to_check;
+	BlockNumber		page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2039,6 +2042,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -2047,6 +2052,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check =  tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2062,6 +2073,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2095,6 +2129,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2152,6 +2187,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just in case */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2161,11 +2210,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 6b7117b56b2..7ea60c18e6f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ed563da5a32..16958bc8e91 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -68,6 +68,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3511,8 +3512,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3525,7 +3527,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3546,13 +3548,14 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3602,8 +3605,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3639,6 +3646,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3658,7 +3668,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3667,6 +3683,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3686,19 +3705,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3721,6 +3745,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 599e3375833..7f8adbdbda3 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -595,7 +595,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1813,32 +1812,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1860,8 +1838,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4426,7 +4404,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4441,13 +4418,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4459,16 +4429,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4481,7 +4443,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a997537800..2380a593d71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -701,12 +701,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1829,20 +1828,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6fa91bfcdc0..b33084cb91a 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 37a390d33de..d1f6411dd78 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -152,7 +152,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



  [application/x-patch] v30-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (19.7K, 7-v30-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From c2296f78588a531978e5aa8e96cd6c4eae9a5f65 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v30 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values: - fixed-length datatypes and
 variable-length datatypes: include a length header - by-value types: store
 inline with one extra byte (but without support of random access)

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 330 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 293 insertions(+), 70 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index afba82f28a2..692e325eafd 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
@@ -115,16 +120,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -149,6 +153,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -185,6 +195,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -193,9 +204,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -206,10 +217,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -241,11 +255,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -268,6 +287,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -345,6 +370,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -443,16 +499,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -776,6 +835,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1027,10 +1105,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1042,6 +1120,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1059,7 +1138,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1090,7 +1169,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1152,6 +1231,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1460,8 +1574,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1556,25 +1673,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1585,6 +1683,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1631,3 +1742,112 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8_t	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8_t), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8_t) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8_t junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size = state->datumTypeLen;
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+
+		BufFileWrite(state->myfile, &size, sizeof(size));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &size, sizeof(size));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		Datum *data = palloc(len);
+		BufFileReadExact(state->myfile, data, len);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 1c08e219e89..665d6d57635 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/x-patch] v30-0001-Add-stress-tests-for-concurrent-index-builds.patch (9.3K, 8-v30-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 6f41e75a4d85a8b46d88e3ad87ef184486f8f9a8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v30 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 225 ++++++++++++++++++++++++++++++++
 2 files changed, 226 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..f160f9d18d7
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,225 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+# uncomment to force non-HOT -> $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=15 --jobs=4 --exit-on-abort --transactions=1000',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\sleep 10 ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\sleep 10 ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-03-23 22:08                                                                   ` Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-03-23 22:08 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Fixed compilation, updates stress test, fixed few potential issues
with tuplestore, some style fixes around.

Best regards,
Mikhail.


Attachments:

  [application/octet-stream] v31-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 2-v31-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From bd1919d3a299ac927322d2c3d5eee1b273ba43a5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v31 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values: - fixed-length datatypes and
 variable-length datatypes: include a length header - by-value types: store
 inline with one extra byte (but without support of random access)

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 361 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 321 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 273a4c9b02f..3fc54deb0fd 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 #include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -207,10 +218,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_advance - exported function to adjust position without fetching
  *
@@ -1173,10 +1287,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
 			pfree(tuple);
 		return true;
 	}
-	else
+
+	/*
+	 * A NULL return normally means end-of-data, but for by-value datum
+	 * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+	 * from NULL via pointer check.  Use eof_reached to distinguish.
+	 */
+	if (state->datumTypeByVal)
 	{
-		return false;
+		TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+		return !readptr->eof_reached;
 	}
+
+	return false;
 }
 
 /*
@@ -1239,7 +1363,12 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
 				tuple = tuplestore_gettuple(state, forward, &should_free);
 
 				if (tuple == NULL)
-					return false;
+				{
+					/* See tuplestore_advance for why pointer check is insufficient */
+					if (!state->datumTypeByVal ||
+						state->readptrs[state->activeptr].eof_reached)
+						return false;
+				}
 				if (should_free)
 					pfree(tuple);
 				CHECK_FOR_INTERRUPTS();
@@ -1461,8 +1590,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1557,25 +1689,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1586,6 +1699,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1632,3 +1758,122 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8_t	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8_t), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8_t) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8_t junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size = state->datumTypeLen;
+		unsigned int tuplen;
+
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+
+		/*
+		 * Include sizeof(unsigned int) in the stored length, matching the
+		 * convention used by writetup_heap.  The backward-scan seek
+		 * arithmetic in tuplestore_gettuple assumes this.
+		 */
+		tuplen = size + sizeof(unsigned int);
+		BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		unsigned int datalen = len - sizeof(unsigned int);
+		Datum *data = palloc(datalen);
+
+		BufFileReadExact(state->myfile, data, datalen);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index 1c08e219e89..665d6d57635 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_advance(Tuplestorestate *state, bool forward);
 
-- 
2.43.0



  [application/octet-stream] v31-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.4K, 3-v31-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 2b1a423175ca6893b56bda69e1827e595f22df5e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v31 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 565 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 113 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 766 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..f1785b9a456 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3138,6 +3138,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3189,6 +3190,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..f21b229de42
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,565 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+			START_CRIT_SECTION();
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+				END_CRIT_SECTION();
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			END_CRIT_SECTION();
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+		/* Re-check after acquiring exclusive lock */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+
+		/* Check if another backend already extended the index */
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+			START_CRIT_SECTION();
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			END_CRIT_SECTION();
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about an index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	START_CRIT_SECTION();
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	END_CRIT_SECTION();
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b3c60d91f9..f5484c59d18 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3412,6 +3412,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 078a1cf5127..c33e43df1ec 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -313,6 +313,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..1fbe70d187c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -726,6 +726,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..dfdccfaf991 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -885,6 +885,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1a27bf060b3..0356901ee10 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index 0bd17b30ca7..e2966165e6f 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -52,8 +52,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..18ee36506fd
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,113 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "access/xlog.h"
+#include "access/generic_xlog.h"
+#include "access/itup.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fc8d82665b8..bac9a148700 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..0f834889912 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -227,7 +227,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 6ff4d7ee901..9259679eea2 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2129,9 +2129,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v31-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (94.9K, 4-v31-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From a1b4fb5ced0e25ab86dfbb628b94ce0b69c23019 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v31 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  40 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 553 ++++++++++++++-------
 src/backend/catalog/index.c                | 308 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 17 files changed, 1123 insertions(+), 336 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 462019a972c..b8031a3cb39 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6677,6 +6677,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6717,13 +6729,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6740,8 +6751,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..9e0248261ae 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..f90310a1ab8 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,15 +41,17 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
-									 Relation OldHeap, Relation NewHeap,
-									 Datum *values, bool *isnull, RewriteState rwstate);
+                                     Relation OldHeap, Relation NewHeap,
+                                     Datum *values, bool *isnull, RewriteState rwstate);
 
 static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 								   HeapTuple tuple,
@@ -1768,242 +1770,409 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+										   Tuplesortstate *aux,
+										   Tuplestorestate *store)
+{
+	int64		num = 0;
+	/* state variables for the merge */
+	ItemPointer	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Attempt to fetch the next TID from the auxiliary sort. If it's
+		 * empty, we set auxindexcursor to NULL.
+		 */
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		 * If the auxiliary sort is not yet empty, we now try to synchronize
+		 * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		 * the main sort cursor until we've reached or passed the auxiliary TID.
+		 */
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int64			num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber* tuples;
+	ReadStream *read_stream;
+
+	/* Use 10% of memory for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem / 10;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void*) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
-		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
-		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
-		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f5484c59d18..31f92b97580 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,7 +1412,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1473,6 +1488,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2453,7 +2616,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2513,7 +2677,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3289,12 +3454,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3304,14 +3478,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3319,12 +3496,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3342,22 +3521,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3390,6 +3573,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3414,15 +3598,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3445,27 +3663,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3474,6 +3695,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3534,6 +3756,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3805,6 +4033,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4047,6 +4282,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4072,6 +4308,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f1ed7b58f13..0dfa46a9b74 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1379,16 +1379,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cbd76066f74..dc4af0409df 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1601,6 +1617,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1629,11 +1655,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1643,7 +1669,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1682,7 +1708,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1694,14 +1720,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1736,9 +1792,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1756,24 +1831,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1800,7 +1865,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1825,6 +1890,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3596,6 +3708,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3701,8 +3814,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3754,8 +3874,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3816,6 +3943,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3919,15 +4053,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3978,6 +4115,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3991,12 +4133,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4005,6 +4152,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4023,10 +4171,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4107,13 +4259,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4160,6 +4359,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4167,12 +4401,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4210,7 +4438,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4239,7 +4467,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4330,14 +4558,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4362,6 +4590,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4375,11 +4625,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4399,6 +4649,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..1a997537800 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -705,7 +705,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1824,19 +1825,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 36b70689254..727993d1a5a 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -71,6 +72,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -151,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9c40772706c..8e5f98c6fad 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -117,14 +117,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index dc629928c8f..9b06ddc87a2 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 32bea58db2c..b80d5c2ed65 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2058,14 +2058,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v31-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (30.9K, 5-v31-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 40c6d37815c25c63d3ec1e0b4e119e193795fa02 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v31 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  71 ++++++++++----
 src/backend/catalog/pg_depend.c            |  58 ++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 371 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..7f751453317 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 9e0248261ae..54f7b36efa2 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 31f92b97580..4b6a0f76c81 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1581,7 +1593,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2617,7 +2630,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2678,7 +2692,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3843,6 +3858,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3899,6 +3915,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4187,7 +4216,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4276,13 +4306,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4308,18 +4355,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..7e0e29bdb5b 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,63 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * We assume AUTO dependency on index with rel_kind
+		 * of RELKIND_INDEX and AM eq STIR is that we are looking for.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index c33e43df1ec..b16eac0357f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dc4af0409df..b430d4a5b34 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3709,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4058,6 +4060,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4065,6 +4068,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4138,12 +4142,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4153,6 +4162,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4174,10 +4184,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4366,7 +4384,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4389,6 +4408,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4608,6 +4630,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4659,6 +4683,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 67e42e5df29..87aba245b85 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0f834889912..f97fcb7872c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -229,6 +229,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v31-0001-Add-stress-tests-for-concurrent-index-builds.patch (11.9K, 6-v31-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From b6a36e045192906f72cb805f33f4cccafd780f89 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v31 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 273 ++++++++++++++++++++++++++++++++
 2 files changed, 274 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..0495ac10263
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,273 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+  $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+  $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+	'--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+	$pgbench_clients,
+	$pgbench_jobs,
+	$pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+		sprintf(
+		'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+		defined($ENV{PG_TEST_EXTRA})
+		? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+		: '(undef)',
+		$is_stress,
+		$pgbench_clients,
+		$pgbench_jobs,
+		$pgbench_transactions,
+		$max_sleep_ms,
+		$no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => sprintf(q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN/GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_gin_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+		});
+
+$node->stop;
+done_testing();
-- 
2.43.0



  [application/octet-stream] v31-0006-Optimize-auxiliary-index-handling.patch (2.1K, 7-v31-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From bb56f91df5a44c7865e6f599738cdec476497021 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v31 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 11 +++++++++++
 src/backend/executor/execIndexing.c |  5 ++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4b6a0f76c81..2d7d25f1986 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2917,6 +2917,17 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+		{
+			values[i] = PointerGetDatum(NULL);
+			isnull[i] = true;
+		}
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..ce76a213556 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && likely(!indexInfo->ii_Auxiliary) &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
-- 
2.43.0



  [application/octet-stream] v31-0007-Refresh-snapshot-periodically-during-index-valid.patch (22.5K, 8-v31-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 76c15aa5624a9dd861862bd42956cebf042459bc Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v31 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT       |  4 +-
 src/backend/access/heap/heapam_handler.c | 65 +++++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c    | 12 +++--
 src/backend/catalog/index.c              | 63 ++++++++++++++++-------
 src/backend/commands/indexcmds.c         | 50 +++---------------
 src/include/access/tableam.h             | 25 ++++-----
 src/include/access/transam.h             | 15 ++++++
 src/include/catalog/index.h              |  2 +-
 8 files changed, 153 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index f90310a1ab8..78bc1bff70e 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2022,23 +2022,26 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int64			num_to_check;
+	BlockNumber		page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
 	ValidateIndexScanState callback_private_data;
 
@@ -2049,6 +2052,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use 10% of memory for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem / 10;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -2057,6 +2062,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2072,6 +2083,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2105,6 +2139,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2162,6 +2197,20 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+#define VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE 4096
+		if (page_read_counter % VALIDATE_INDEX_RESET_SNAPSHOT_EACH_N_PAGE == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just in case */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2171,11 +2220,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 6b7117b56b2..7ea60c18e6f 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2d7d25f1986..c37a786dafd 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3514,8 +3515,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3528,7 +3530,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3541,21 +3543,24 @@ IndexCheckExclusion(Relation heapRelation,
  * before it declares a uniqueness error.
  *
  * After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index).  The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit.  Subsequent transactions
+ * will be able to use it for queries.
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3605,8 +3610,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3642,6 +3651,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3661,7 +3673,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3670,6 +3688,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3689,19 +3710,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3724,6 +3750,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b430d4a5b34..0e7b961b170 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1814,32 +1813,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1861,8 +1839,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4427,7 +4405,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4442,13 +4419,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4460,16 +4430,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4482,7 +4444,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a997537800..2380a593d71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -701,12 +701,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1829,20 +1828,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6fa91bfcdc0..b33084cb91a 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 727993d1a5a..91666663834 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -158,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-03-28 19:17                                                                     ` Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-03-28 19:17 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Small fixes, comments, support for high isolation level, etc.


Attachments:

  [application/octet-stream] v32-0001-Add-stress-tests-for-concurrent-index-builds.patch (11.9K, 3-v32-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 44293162407526557b77eb5a783d916a3648c474 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v32 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 273 ++++++++++++++++++++++++++++++++
 2 files changed, 274 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..47fc65b9dab
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,273 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my $node;
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+  $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+  $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+	'--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+	$pgbench_clients,
+	$pgbench_jobs,
+	$pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+		sprintf(
+		'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+		defined($ENV{PG_TEST_EXTRA})
+		? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+		: '(undef)',
+		$is_stress,
+		$pgbench_clients,
+		$pgbench_jobs,
+		$pgbench_transactions,
+		$max_sleep_ms,
+		$no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => sprintf(q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN',
+	{
+		'concurrent_ops_gin_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+		});
+
+$node->stop;
+done_testing();
-- 
2.43.0



  [application/octet-stream] v32-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (31.7K, 4-v32-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From 73d3dee9747f968b31c877c2e62ec6d4609671de Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v32 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  78 +++++++++++----
 src/backend/catalog/pg_depend.c            |  62 ++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 380 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..406c02e866e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    the recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 9e0248261ae..ac9cfec5c55 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, the recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7f29dfa0b28..5bf7fe131c0 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1584,7 +1596,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2623,7 +2636,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2684,7 +2698,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3763,8 +3778,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			indexForm->indisvalid = true;
 			break;
 		case INDEX_DROP_CLEAR_READY:
-			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
-			Assert(indexForm->indisready);
+			/*
+			 * Clear indisready during a CREATE INDEX CONCURRENTLY sequence.
+			 * indisready may already be false if the CIC failed before
+			 * index_concurrently_build had a chance to set it.
+			 */
 			Assert(!indexForm->indisvalid);
 			indexForm->indisready = false;
 			break;
@@ -3849,6 +3867,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3905,6 +3924,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4193,7 +4225,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4282,13 +4315,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4314,18 +4364,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..deacd2f7c95 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,67 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * Look for an AUTO dependency on a STIR index.  There can be at most
+		 * one STIR auxiliary per index, so we stop at the first match.
+		 * Transitive auxiliaries (e.g. ccnew_ccaux from a failed REINDEX
+		 * CONCURRENTLY) are found by calling this with the ccnew OID, and
+		 * are also cleaned up automatically via cascading AUTO dependency
+		 * when the intermediate index is dropped.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index c33e43df1ec..b16eac0357f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index cb07e1ae389..de603d3ff83 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3711,6 +3712,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4060,6 +4062,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4067,6 +4070,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4140,12 +4144,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4155,6 +4164,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4176,10 +4186,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4368,7 +4386,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4391,6 +4410,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4610,6 +4632,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4661,6 +4685,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c69c12dc014..df29a7021b7 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 74efa237212..136dddbbf11 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -229,6 +229,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v32-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (98.0K, 5-v32-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From e0ea26d0562821d2ab8090c26573da675b19b2f5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v32 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  40 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 561 ++++++++++++++-------
 src/backend/catalog/index.c                | 322 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/backend/utils/misc/guc_parameters.dat  |   9 +
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/miscadmin.h                    |   1 +
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1155 insertions(+), 336 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bb75ed1069b..835b4aeed77 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6787,6 +6787,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6827,13 +6839,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6850,8 +6861,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..9e0248261ae 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d40878928e1..194ac75caa5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -42,15 +42,20 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
+
+/* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
+int			debug_cic_validate_store_mem_pct = 10;
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
-									 Relation OldHeap, Relation NewHeap,
-									 Datum *values, bool *isnull, RewriteState rwstate);
+                                     Relation OldHeap, Relation NewHeap,
+                                     Datum *values, bool *isnull, RewriteState rwstate);
 
 static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 								   HeapTuple tuple,
@@ -1769,242 +1774,422 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+										   Tuplesortstate *aux,
+										   Tuplestorestate *store)
+{
+	int64		num = 0;
+	/* state variables for the merge */
+	ItemPointer	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Attempt to fetch the next TID from the auxiliary sort. If it's
+		 * empty, we set auxindexcursor to NULL.
+		 */
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		 * If the auxiliary sort is not yet empty, we now try to synchronize
+		 * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		 * the main sort cursor until we've reached or passed the auxiliary TID.
+		 */
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int64			num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber *tuples;
+	ReadStream *read_stream;
+
+	/* Use a percentage of maintenance_work_mem for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void **) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
+		 * It is safe to access tuple data after releasing the buffer lock
+		 * because the buffer pin is still held, and the only operation that
+		 * could physically move tuple data on the page is
+		 * PageRepairFragmentation via heap_page_prune.  VACUUM conflicts with
+		 * CIC (both take ShareUpdateExclusiveLock), and opportunistic pruning
+		 * from concurrent DML cannot affect root tuples we are referencing.
 		 */
-		if (hscan->rs_cblock != root_blkno)
-		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
+		 * No predicate evaluation is needed here: the auxiliary STIR index
+		 * only contains TIDs for tuples that already satisfied the partial
+		 * index predicate at DML time (checked in ExecInsertIndexTuples).
 		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 2fc86ca9c5b..7f29dfa0b28 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,20 +1412,24 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
 	 * index information.  All this information will be used for the index
 	 * creation.
 	 */
-	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
 	{
 		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
-		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
 
-		indexColNames = lappend(indexColNames, NameStr(att->attname));
-		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
 	}
 
 	/* Extract opclass options for each attribute */
@@ -1473,6 +1491,157 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2453,7 +2622,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2513,7 +2683,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3289,12 +3460,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3304,14 +3484,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3319,12 +3502,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3342,22 +3527,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3390,6 +3579,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3414,15 +3604,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3445,27 +3669,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3474,6 +3701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3534,6 +3762,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3805,6 +4039,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4047,6 +4288,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4072,6 +4314,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e54018004db..08634c43ea6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1388,16 +1388,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index dd593ccbc1c..cb07e1ae389 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1603,6 +1619,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1631,11 +1657,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1645,7 +1671,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1684,7 +1710,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1696,14 +1722,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure there are no transactions with the auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1738,9 +1794,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1758,24 +1833,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1802,7 +1867,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1827,6 +1892,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3598,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3703,8 +3816,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3756,8 +3876,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3818,6 +3945,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3921,15 +4055,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3980,6 +4117,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3993,12 +4135,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4007,6 +4154,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4025,10 +4173,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4109,13 +4261,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4162,6 +4361,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4169,12 +4403,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4212,7 +4440,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4241,7 +4469,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4332,14 +4560,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4364,6 +4592,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4377,11 +4627,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4401,6 +4651,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0a862693fcd..a80ee4fb03f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -631,6 +631,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_store_mem_pct',
+  boot_val => '10',
+  min => '1',
+  max => '90',
+},
+
 { name => 'debug_copy_parse_plan_trees', type => 'bool', context => 'PGC_SUSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Set this to force all parse and plan trees to be passed through copyObject(), to facilitate catching errors and omissions in copyObject().',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..1a997537800 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -705,7 +705,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1824,19 +1825,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 36b70689254..727993d1a5a 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -71,6 +72,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -151,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9c40772706c..8e5f98c6fad 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -117,14 +117,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..f4f4aa19963 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -268,6 +268,7 @@ extern PGDLLIMPORT bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index f50868ca6a6..b34009f868c 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2b3cf6d8569..b01fa1e61e3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2064,14 +2064,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v32-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 6-v32-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From 69264c56ed0ecdac67080215896033dd4767df25 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v32 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 367 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 327 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index caad7cad0b4..132ecf22088 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 #include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -207,10 +218,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_gettupleslot_force - exported function to fetch a tuple
  *
@@ -1205,10 +1319,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
 			pfree(tuple);
 		return true;
 	}
-	else
+
+	/*
+	 * A NULL return normally means end-of-data, but for by-value datum
+	 * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+	 * from NULL via pointer check.  Use eof_reached to distinguish.
+	 */
+	if (state->datumTypeByVal)
 	{
-		return false;
+		TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+		return !readptr->eof_reached;
 	}
+
+	return false;
 }
 
 /*
@@ -1271,7 +1395,13 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
 				tuple = tuplestore_gettuple(state, forward, &should_free);
 
 				if (tuple == NULL)
-					return false;
+				{
+					/* See tuplestore_advance for why pointer check is insufficient */
+					if (!state->datumTypeByVal ||
+						state->readptrs[state->activeptr].eof_reached)
+						return false;
+					continue;
+				}
 				if (should_free)
 					pfree(tuple);
 				CHECK_FOR_INTERRUPTS();
@@ -1493,8 +1623,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 	}
 	state->memtupdeleted = nremove;
@@ -1589,25 +1722,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1618,6 +1732,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1664,3 +1791,127 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	if (datum == NULL)
+		return NULL;
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8 junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size;
+		unsigned int tuplen;
+
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		else
+			size = state->datumTypeLen;
+
+		/*
+		 * Include sizeof(unsigned int) in the stored length, matching the
+		 * convention used by writetup_heap.  The backward-scan seek
+		 * arithmetic in tuplestore_gettuple assumes this.
+		 */
+		tuplen = size + sizeof(unsigned int);
+		BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = 0;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		unsigned int datalen = len - sizeof(unsigned int);
+		void *data = palloc(datalen);
+
+		BufFileReadExact(state->myfile, data, datalen);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index f638b96e156..e16d9a3d352 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
 										  bool copy, TupleTableSlot *slot);
-- 
2.43.0



  [application/octet-stream] v32-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.6K, 7-v32-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 85db69b74e79ca165ae7db874a47ddd054ee77e6 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v32 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 567 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 110 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 765 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f698c2d899b..339dfb21df7 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3012,6 +3012,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3063,6 +3064,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..932590d9ccb
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+		/* Re-check after acquiring exclusive lock */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+
+		/* Check if another backend already extended the index */
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/*
+	 * For normal VACUUM, mark to skip inserts and warn about an index drop
+	 * needed.  In practice this path is not reachable during CREATE INDEX
+	 * CONCURRENTLY because the table-level locks held by CIC prevent concurrent
+	 * VACUUM from opening the auxiliary index.  It can only be reached if a
+	 * leftover STIR index somehow survives after a failed CIC and a later
+	 * VACUUM encounters it.
+	 */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * As with stirbulkdelete, this is not reachable during a normal CIC due to
+ * table-level locking.  It serves as a safety net for leftover STIR indexes
+ * from failed concurrent index builds.
+ */
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d8219b18c48..2fc86ca9c5b 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3412,6 +3412,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 078a1cf5127..c33e43df1ec 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -313,6 +313,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..1fbe70d187c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -726,6 +726,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..1671c3c2196 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -896,6 +896,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 1a27bf060b3..0356901ee10 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index c228147420a..1c7a4d17557 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..b08cf4d4ef0
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "nodes/pathnodes.h"
+#include "storage/bufpage.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0118e970dda..9649995f812 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 684e398f824..74efa237212 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -227,7 +227,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 6ff4d7ee901..9259679eea2 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2129,9 +2129,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v32-0007-Refresh-snapshot-periodically-during-index-valid.patch (27.0K, 8-v32-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 9cd374760b80ff1699c3dd16882ffb2263ac81a5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v32 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT         |  4 +-
 src/backend/access/heap/heapam_handler.c   | 77 +++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c      | 12 +++-
 src/backend/catalog/index.c                | 73 +++++++++++++++-----
 src/backend/commands/indexcmds.c           | 50 ++------------
 src/backend/utils/misc/guc_parameters.dat  |  9 +++
 src/include/access/tableam.h               | 25 ++++---
 src/include/access/transam.h               | 15 +++++
 src/include/catalog/index.h                |  2 +-
 src/include/miscadmin.h                    |  1 +
 src/test/regress/expected/create_index.out |  3 +
 src/test/regress/sql/create_index.sql      |  4 ++
 12 files changed, 192 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 194ac75caa5..5f5431ba389 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -53,6 +53,9 @@
 /* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
 int			debug_cic_validate_store_mem_pct = 10;
 
+/* GUC: refresh snapshot every N pages during CIC validation (0 = disable) */
+int			debug_cic_validate_snapshot_pages = 4096;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
                                      Relation OldHeap, Relation NewHeap,
                                      Datum *values, bool *isnull, RewriteState rwstate);
@@ -2026,24 +2029,35 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int64			num_to_check;
+	int64			page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
+
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
 	ValidateIndexScanState callback_private_data;
 
 	Buffer buf;
@@ -2053,6 +2067,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use a percentage of maintenance_work_mem for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -2061,6 +2077,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!reset_snapshot || !HaveRegisteredOrActiveSnapshot());
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2076,6 +2098,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2109,6 +2154,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2179,6 +2225,21 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+		if (reset_snapshot &&
+			debug_cic_validate_snapshot_pages > 0 &&
+			page_read_counter % debug_cic_validate_snapshot_pages == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* Advance limitXmin so we wait for all snapshots seen so far */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2188,11 +2249,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index c461f8dc02d..ef192fb99c2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8e4dfe88aa..5f8779426c7 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3518,8 +3519,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3532,7 +3534,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3545,21 +3547,24 @@ IndexCheckExclusion(Relation heapRelation,
  * before it declares a uniqueness error.
  *
  * After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index).  The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit.  Subsequent transactions
+ * will be able to use it for queries.
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3572,6 +3577,16 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
 	int			aux_work_mem_part = maintenance_work_mem / 10;
 
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
+#endif
+
 	{
 		const int	progress_index[] = {
 			PROGRESS_CREATEIDX_PHASE,
@@ -3609,8 +3624,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3646,6 +3665,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3665,7 +3687,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3674,6 +3702,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3693,19 +3724,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3728,6 +3764,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index de603d3ff83..bf8a1dbc35d 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1816,32 +1815,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1863,8 +1841,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4429,7 +4407,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4444,13 +4421,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4462,16 +4432,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4484,7 +4446,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a80ee4fb03f..be29cf3ba5a 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -631,6 +631,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_snapshot_pages', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Refresh snapshot every N pages during CIC validation (0 to disable).',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_snapshot_pages',
+  boot_val => '4096',
+  min => '0',
+  max => '1000000',
+},
+
 { name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 1a997537800..2380a593d71 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -701,12 +701,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1829,20 +1828,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6fa91bfcdc0..b33084cb91a 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 727993d1a5a..91666663834 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -158,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f4f4aa19963..2af08b66d43 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -269,6 +269,7 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
+extern PGDLLIMPORT int debug_cic_validate_snapshot_pages;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 2d6abb15a89..758c5884ff5 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3382,6 +3382,9 @@ DROP INDEX aux_index_ind6;
 --------+---------+-----------+----------+---------
  c1     | integer |           |          | 
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fd96d80abbc..65dd58b947d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1400,6 +1400,10 @@ DROP INDEX aux_index_ind6;
 -- Make sure auxiliary index dropped too
 \d aux_index_tab5
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v32-0006-Optimize-auxiliary-index-handling.patch (3.0K, 9-v32-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 1ceedab1465ccb7c981667c210d35a2e14615c77 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v32 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 9 +++++++++
 src/backend/executor/execIndexing.c | 5 ++++-
 src/include/nodes/execnodes.h       | 6 ++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 5bf7fe131c0..b8e4dfe88aa 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2923,6 +2923,15 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		Assert(indexInfo->ii_Am == STIR_AM_OID);
+		memset(values, 0, sizeof(Datum) * indexInfo->ii_NumIndexAttrs);
+		memset(isnull, true, sizeof(bool) * indexInfo->ii_NumIndexAttrs);
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 9d071e495c6..b0e606460be 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && !indexInfo->ii_Auxiliary &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 136dddbbf11..69441685ddb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,10 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.  ii_Auxiliary
+ * is also used during retail inserts to skip datum formation for auxiliary
+ * indexes.
  * ----------------
  */
 typedef struct IndexInfo
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-03-31 22:11                                                                       ` Mihail Nikalayeu <[email protected]>
  2026-04-06 18:21                                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-03-31 22:11 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello!

Just rebased.


Attachments:

  [application/x-patch] v33-0001-Add-stress-tests-for-concurrent-index-builds.patch (11.9K, 2-v33-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 1377297603529fadec0727ee3fa7dc51853b440d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v33 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 273 ++++++++++++++++++++++++++++++++
 2 files changed, 274 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..47fc65b9dab
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,273 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my $node;
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+  $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+  $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+	'--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+	$pgbench_clients,
+	$pgbench_jobs,
+	$pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+		sprintf(
+		'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+		defined($ENV{PG_TEST_EXTRA})
+		? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+		: '(undef)',
+		$is_stress,
+		$pgbench_clients,
+		$pgbench_jobs,
+		$pgbench_transactions,
+		$max_sleep_ms,
+		$no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => sprintf(q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN',
+	{
+		'concurrent_ops_gin_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					DROP INDEX CONCURRENTLY new_idx;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+		});
+
+$node->stop;
+done_testing();
-- 
2.53.0



  [application/x-patch] v33-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 3-v33-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From 2d640837af25d4cdd0ab54d61c641ee2dd5a7c8d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v33 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 367 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 327 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index f9e2d95186a..2a9b25bd238 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 #include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -207,10 +218,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_gettupleslot_force - exported function to fetch a tuple
  *
@@ -1205,10 +1319,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
 			pfree(tuple);
 		return true;
 	}
-	else
+
+	/*
+	 * A NULL return normally means end-of-data, but for by-value datum
+	 * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+	 * from NULL via pointer check.  Use eof_reached to distinguish.
+	 */
+	if (state->datumTypeByVal)
 	{
-		return false;
+		TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+		return !readptr->eof_reached;
 	}
+
+	return false;
 }
 
 /*
@@ -1271,7 +1395,13 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
 				tuple = tuplestore_gettuple(state, forward, &should_free);
 
 				if (tuple == NULL)
-					return false;
+				{
+					/* See tuplestore_advance for why pointer check is insufficient */
+					if (!state->datumTypeByVal ||
+						state->readptrs[state->activeptr].eof_reached)
+						return false;
+					continue;
+				}
 				if (should_free)
 					pfree(tuple);
 				CHECK_FOR_INTERRUPTS();
@@ -1505,8 +1635,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 		/* As in dumptuples(), increment memtupdeleted synchronously */
 		state->memtupdeleted++;
@@ -1603,25 +1736,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1632,6 +1746,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1678,3 +1805,127 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	if (datum == NULL)
+		return NULL;
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8 junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size;
+		unsigned int tuplen;
+
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		else
+			size = state->datumTypeLen;
+
+		/*
+		 * Include sizeof(unsigned int) in the stored length, matching the
+		 * convention used by writetup_heap.  The backward-scan seek
+		 * arithmetic in tuplestore_gettuple assumes this.
+		 */
+		tuplen = size + sizeof(unsigned int);
+		BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = 0;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		unsigned int datalen = len - sizeof(unsigned int);
+		void *data = palloc(datalen);
+
+		BufFileReadExact(state->myfile, data, datalen);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index f638b96e156..e16d9a3d352 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
 										  bool copy, TupleTableSlot *slot);
-- 
2.53.0



  [application/x-patch] v33-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (98.0K, 4-v33-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 8ae3087b8d1d04b62b623145036dfb4a83197a80 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v33 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  40 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 561 ++++++++++++++-------
 src/backend/catalog/index.c                | 322 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 344 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/backend/utils/misc/guc_parameters.dat  |   9 +
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/miscadmin.h                    |   1 +
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1155 insertions(+), 336 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index bb75ed1069b..835b4aeed77 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6787,6 +6787,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6827,13 +6839,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6850,8 +6861,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..9e0248261ae 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cdd153c6b6d..3a04453ff5d 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -42,15 +42,20 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
+
+/* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
+int			debug_cic_validate_store_mem_pct = 10;
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
-									 Relation OldHeap, Relation NewHeap,
-									 Datum *values, bool *isnull, RewriteState rwstate);
+                                     Relation OldHeap, Relation NewHeap,
+                                     Datum *values, bool *isnull, RewriteState rwstate);
 
 static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 								   HeapTuple tuple,
@@ -1773,242 +1778,422 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+										   Tuplesortstate *aux,
+										   Tuplestorestate *store)
+{
+	int64		num = 0;
+	/* state variables for the merge */
+	ItemPointer	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Attempt to fetch the next TID from the auxiliary sort. If it's
+		 * empty, we set auxindexcursor to NULL.
+		 */
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		 * If the auxiliary sort is not yet empty, we now try to synchronize
+		 * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		 * the main sort cursor until we've reached or passed the auxiliary TID.
+		 */
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int64			num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber *tuples;
+	ReadStream *read_stream;
+
+	/* Use a percentage of maintenance_work_mem for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void **) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
+		 * It is safe to access tuple data after releasing the buffer lock
+		 * because the buffer pin is still held, and the only operation that
+		 * could physically move tuple data on the page is
+		 * PageRepairFragmentation via heap_page_prune.  VACUUM conflicts with
+		 * CIC (both take ShareUpdateExclusiveLock), and opportunistic pruning
+		 * from concurrent DML cannot affect root tuples we are referencing.
 		 */
-		if (hscan->rs_cblock != root_blkno)
-		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
+		 * No predicate evaluation is needed here: the auxiliary STIR index
+		 * only contains TIDs for tuples that already satisfied the partial
+		 * index predicate at DML time (checked in ExecInsertIndexTuples).
 		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 0ceeda1fdd9..f6d0ac3f784 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,11 +715,16 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  * constr_flags: flags passed to index_constraint_create
  *		(only if INDEX_CREATE_ADD_CONSTRAINT is set)
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -760,6 +765,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -785,7 +791,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -793,6 +802,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1398,20 +1412,24 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							false,	/* not ready for inserts */
 							true,
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
 	 * index information.  All this information will be used for the index
 	 * creation.
 	 */
-	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
 	{
 		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
-		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
 
-		indexColNames = lappend(indexColNames, NameStr(att->attname));
-		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
 	}
 
 	/* Extract opclass options for each attribute */
@@ -1473,6 +1491,157 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2453,7 +2622,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2513,7 +2683,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3289,12 +3460,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3304,14 +3484,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3319,12 +3502,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3342,22 +3527,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3390,6 +3579,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3414,15 +3604,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3445,27 +3669,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3474,6 +3701,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3534,6 +3762,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3805,6 +4039,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4047,6 +4288,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4072,6 +4314,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e54018004db..08634c43ea6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1388,16 +1388,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 373e8234794..e06353f3fde 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1603,6 +1619,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1631,11 +1657,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1645,7 +1671,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1684,7 +1710,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1696,14 +1722,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure there are no transactions with the auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1738,9 +1794,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1758,24 +1833,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1802,7 +1867,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1827,6 +1892,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3598,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3703,8 +3816,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3756,8 +3876,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3818,6 +3945,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3921,15 +4055,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3980,6 +4117,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3993,12 +4135,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4007,6 +4154,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4025,10 +4173,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4109,13 +4261,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4162,6 +4361,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4169,12 +4403,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4212,7 +4440,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4241,7 +4469,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4332,14 +4560,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4364,6 +4592,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4377,11 +4627,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4401,6 +4651,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0a862693fcd..a80ee4fb03f 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -631,6 +631,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_store_mem_pct',
+  boot_val => '10',
+  min => '1',
+  max => '90',
+},
+
 { name => 'debug_copy_parse_plan_trees', type => 'bool', context => 'PGC_SUSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Set this to force all parse and plan trees to be passed through copyObject(), to facilitate catching errors and omissions in copyObject().',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 57892152957..3705e21b588 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -731,7 +731,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1882,19 +1883,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index a38e95bc0eb..378701b19f1 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -71,6 +72,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -151,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 9c40772706c..8e5f98c6fad 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -117,14 +117,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 04f29748be7..eea3f818a86 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -268,6 +268,7 @@ extern PGDLLIMPORT bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index f50868ca6a6..b34009f868c 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2b3cf6d8569..b01fa1e61e3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2064,14 +2064,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.53.0



  [application/x-patch] v33-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (31.7K, 5-v33-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From f063c0e1813519757a201a361d1b483719ca5a8e Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v33 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  78 +++++++++++----
 src/backend/catalog/pg_depend.c            |  62 ++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 380 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..406c02e866e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    the recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 9e0248261ae..ac9cfec5c55 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, the recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index f6d0ac3f784..aaf0b30ff9d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -776,6 +776,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1181,6 +1183,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1413,7 +1424,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							true,
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -1584,7 +1596,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2623,7 +2636,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2684,7 +2698,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3763,8 +3778,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			indexForm->indisvalid = true;
 			break;
 		case INDEX_DROP_CLEAR_READY:
-			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
-			Assert(indexForm->indisready);
+			/*
+			 * Clear indisready during a CREATE INDEX CONCURRENTLY sequence.
+			 * indisready may already be false if the CIC failed before
+			 * index_concurrently_build had a chance to set it.
+			 */
 			Assert(!indexForm->indisvalid);
 			indexForm->indisready = false;
 			break;
@@ -3849,6 +3867,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3905,6 +3924,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4193,7 +4225,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4282,13 +4315,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4314,18 +4364,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..deacd2f7c95 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,67 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * Look for an AUTO dependency on a STIR index.  There can be at most
+		 * one STIR auxiliary per index, so we stop at the first match.
+		 * Transitive auxiliaries (e.g. ccnew_ccaux from a failed REINDEX
+		 * CONCURRENTLY) are found by calling this with the ccnew OID, and
+		 * are also cleaned up automatically via cascading AUTO dependency
+		 * when the intermediate index is dropped.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index d7ea86b2805..f428dcdf10f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -315,6 +315,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e06353f3fde..0709e4f986b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3711,6 +3712,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4060,6 +4062,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4067,6 +4070,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4140,12 +4144,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4155,6 +4164,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4176,10 +4186,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4368,7 +4386,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4391,6 +4410,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4610,6 +4632,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4661,6 +4685,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8b4ebc6f226..24171a1f165 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 74efa237212..136dddbbf11 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -229,6 +229,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.53.0



  [application/x-patch] v33-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.6K, 6-v33-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 5fddbc07e84d02a73ed43d0c958ee32d65b0ed33 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v33 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 567 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 110 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 765 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..19cfdfd2640 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3012,6 +3012,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3063,6 +3064,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..932590d9ccb
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+		/* Re-check after acquiring exclusive lock */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+
+		/* Check if another backend already extended the index */
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/*
+	 * For normal VACUUM, mark to skip inserts and warn about an index drop
+	 * needed.  In practice this path is not reachable during CREATE INDEX
+	 * CONCURRENTLY because the table-level locks held by CIC prevent concurrent
+	 * VACUUM from opening the auxiliary index.  It can only be reached if a
+	 * leftover STIR index somehow survives after a failed CIC and a later
+	 * VACUUM encounters it.
+	 */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * As with stirbulkdelete, this is not reachable during a normal CIC due to
+ * table-level locking.  It serves as a safety net for leftover STIR indexes
+ * from failed concurrent index builds.
+ */
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 1ccfa687f05..0ceeda1fdd9 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3412,6 +3412,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 4aa52a4bd25..d7ea86b2805 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..cbeb49050cd 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -726,6 +726,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..1671c3c2196 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -896,6 +896,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b69320a7fc8..1a5de4b691a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index e8cb7f7a627..7f3f08a70ac 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..b08cf4d4ef0
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "nodes/pathnodes.h"
+#include "storage/bufpage.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3579cec5744..242808d0402 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 684e398f824..74efa237212 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -227,7 +227,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index 6ff4d7ee901..9259679eea2 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2129,9 +2129,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.53.0



  [application/x-patch] v33-0007-Refresh-snapshot-periodically-during-index-valid.patch (27.0K, 7-v33-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From e36be8d13923aee871a8bc970d88ae903e56956a Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v33 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT         |  4 +-
 src/backend/access/heap/heapam_handler.c   | 77 +++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c      | 12 +++-
 src/backend/catalog/index.c                | 73 +++++++++++++++-----
 src/backend/commands/indexcmds.c           | 50 ++------------
 src/backend/utils/misc/guc_parameters.dat  |  9 +++
 src/include/access/tableam.h               | 25 ++++---
 src/include/access/transam.h               | 15 +++++
 src/include/catalog/index.h                |  2 +-
 src/include/miscadmin.h                    |  1 +
 src/test/regress/expected/create_index.out |  3 +
 src/test/regress/sql/create_index.sql      |  4 ++
 12 files changed, 192 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 3a04453ff5d..836fd83c4a2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -53,6 +53,9 @@
 /* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
 int			debug_cic_validate_store_mem_pct = 10;
 
+/* GUC: refresh snapshot every N pages during CIC validation (0 = disable) */
+int			debug_cic_validate_snapshot_pages = 4096;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
                                      Relation OldHeap, Relation NewHeap,
                                      Datum *values, bool *isnull, RewriteState rwstate);
@@ -2030,24 +2033,35 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int64			num_to_check;
+	int64			page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
+
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
 	ValidateIndexScanState callback_private_data;
 
 	Buffer buf;
@@ -2057,6 +2071,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use a percentage of maintenance_work_mem for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -2065,6 +2081,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!reset_snapshot || !HaveRegisteredOrActiveSnapshot());
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2080,6 +2102,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2113,6 +2158,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2183,6 +2229,21 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+		if (reset_snapshot &&
+			debug_cic_validate_snapshot_pages > 0 &&
+			page_read_counter % debug_cic_validate_snapshot_pages == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* Advance limitXmin so we wait for all snapshots seen so far */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2192,11 +2253,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index c461f8dc02d..ef192fb99c2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index d79047fb284..9f96c902150 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3518,8 +3519,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3532,7 +3534,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3545,21 +3547,24 @@ IndexCheckExclusion(Relation heapRelation,
  * before it declares a uniqueness error.
  *
  * After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index).  The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit.  Subsequent transactions
+ * will be able to use it for queries.
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3572,6 +3577,16 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
 	int			aux_work_mem_part = maintenance_work_mem / 10;
 
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
+#endif
+
 	{
 		const int	progress_index[] = {
 			PROGRESS_CREATEIDX_PHASE,
@@ -3609,8 +3624,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3646,6 +3665,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3665,7 +3687,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3674,6 +3702,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3693,19 +3724,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3728,6 +3764,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 0709e4f986b..a2eb434e20c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1816,32 +1815,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1863,8 +1841,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4429,7 +4407,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4444,13 +4421,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4462,16 +4432,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4484,7 +4446,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a80ee4fb03f..be29cf3ba5a 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -631,6 +631,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_snapshot_pages', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Refresh snapshot every N pages during CIC validation (0 to disable).',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_snapshot_pages',
+  boot_val => '4096',
+  min => '0',
+  max => '1000000',
+},
+
 { name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3705e21b588..49cea7ceef7 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -727,12 +727,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1887,20 +1886,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6fa91bfcdc0..b33084cb91a 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -417,6 +417,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 378701b19f1..e928876c459 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -158,7 +158,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index eea3f818a86..f8c27e0dc63 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -269,6 +269,7 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
+extern PGDLLIMPORT int debug_cic_validate_snapshot_pages;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 2d6abb15a89..758c5884ff5 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3382,6 +3382,9 @@ DROP INDEX aux_index_ind6;
 --------+---------+-----------+----------+---------
  c1     | integer |           |          | 
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fd96d80abbc..65dd58b947d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1400,6 +1400,10 @@ DROP INDEX aux_index_ind6;
 -- Make sure auxiliary index dropped too
 \d aux_index_tab5
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.53.0



  [application/x-patch] v33-0006-Optimize-auxiliary-index-handling.patch (3.0K, 8-v33-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 39fffc2d2fa5d98edac9e290a24a8395842124c7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v33 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 9 +++++++++
 src/backend/executor/execIndexing.c | 5 ++++-
 src/include/nodes/execnodes.h       | 6 ++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index aaf0b30ff9d..d79047fb284 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2923,6 +2923,15 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		Assert(indexInfo->ii_Am == STIR_AM_OID);
+		memset(values, 0, sizeof(Datum) * indexInfo->ii_NumIndexAttrs);
+		memset(isnull, true, sizeof(bool) * indexInfo->ii_NumIndexAttrs);
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 4363e154c0f..84e99d653ec 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && !indexInfo->ii_Auxiliary &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 136dddbbf11..69441685ddb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -166,8 +166,10 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.  ii_Auxiliary
+ * is also used during retail inserts to skip datum formation for auxiliary
+ * indexes.
  * ----------------
  */
 typedef struct IndexInfo
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-04-06 18:21                                                                         ` Mihail Nikalayeu <[email protected]>
  2026-04-07 01:42                                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Josh Kupershmidt <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-04-06 18:21 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Rebased once again.


Attachments:

  [application/octet-stream] v34-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.6K, 2-v34-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From 84805f7f3c1b97941ef7ecaefcbc20c78aaca97b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v34 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 567 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 110 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 765 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..19cfdfd2640 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3012,6 +3012,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3063,6 +3064,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..932590d9ccb
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+		/* Re-check after acquiring exclusive lock */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+
+		/* Check if another backend already extended the index */
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/*
+	 * For normal VACUUM, mark to skip inserts and warn about an index drop
+	 * needed.  In practice this path is not reachable during CREATE INDEX
+	 * CONCURRENTLY because the table-level locks held by CIC prevent concurrent
+	 * VACUUM from opening the auxiliary index.  It can only be reached if a
+	 * leftover STIR index somehow survives after a failed CIC and a later
+	 * VACUUM encounters it.
+	 */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * As with stirbulkdelete, this is not reachable during a normal CIC due to
+ * table-level locking.  It serves as a safety net for leftover STIR indexes
+ * from failed concurrent index builds.
+ */
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9407c357f27..cc067e58d36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3432,6 +3432,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 4aa52a4bd25..d7ea86b2805 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..cbeb49050cd 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -726,6 +726,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..1671c3c2196 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -896,6 +896,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b69320a7fc8..1a5de4b691a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index e8cb7f7a627..7f3f08a70ac 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..b08cf4d4ef0
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "nodes/pathnodes.h"
+#include "storage/bufpage.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ea17fc5629..d4644b0b5ef 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..ecaf82f2afa 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -169,8 +169,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -230,7 +230,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index cfdc6b1a17a..cc947194aa7 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2131,9 +2131,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v34-0001-Add-stress-tests-for-concurrent-index-builds.patch (12.5K, 3-v34-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From c4f389d4c181939bfab95f50b47ebe866252ffa7 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v34 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 293 ++++++++++++++++++++++++++++++++
 2 files changed, 294 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..dd7a1eff0ef
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,293 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my $node;
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+  $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+  $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+	'--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+	$pgbench_clients,
+	$pgbench_jobs,
+	$pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+		sprintf(
+		'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+		defined($ENV{PG_TEST_EXTRA})
+		? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+		: '(undef)',
+		$is_stress,
+		$pgbench_clients,
+		$pgbench_jobs,
+		$pgbench_transactions,
+		$max_sleep_ms,
+		$no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => sprintf(q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN',
+	{
+		'concurrent_ops_gin_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+		});
+
+$node->stop;
+done_testing();
-- 
2.43.0



  [application/octet-stream] v34-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 4-v34-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From 1337fb6930d365238c82113a611620758082cd8d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v34 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 367 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 327 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index f9e2d95186a..2a9b25bd238 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 #include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -207,10 +218,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_gettupleslot_force - exported function to fetch a tuple
  *
@@ -1205,10 +1319,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
 			pfree(tuple);
 		return true;
 	}
-	else
+
+	/*
+	 * A NULL return normally means end-of-data, but for by-value datum
+	 * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+	 * from NULL via pointer check.  Use eof_reached to distinguish.
+	 */
+	if (state->datumTypeByVal)
 	{
-		return false;
+		TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+		return !readptr->eof_reached;
 	}
+
+	return false;
 }
 
 /*
@@ -1271,7 +1395,13 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
 				tuple = tuplestore_gettuple(state, forward, &should_free);
 
 				if (tuple == NULL)
-					return false;
+				{
+					/* See tuplestore_advance for why pointer check is insufficient */
+					if (!state->datumTypeByVal ||
+						state->readptrs[state->activeptr].eof_reached)
+						return false;
+					continue;
+				}
 				if (should_free)
 					pfree(tuple);
 				CHECK_FOR_INTERRUPTS();
@@ -1505,8 +1635,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 		/* As in dumptuples(), increment memtupdeleted synchronously */
 		state->memtupdeleted++;
@@ -1603,25 +1736,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1632,6 +1746,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1678,3 +1805,127 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	if (datum == NULL)
+		return NULL;
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8 junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size;
+		unsigned int tuplen;
+
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		else
+			size = state->datumTypeLen;
+
+		/*
+		 * Include sizeof(unsigned int) in the stored length, matching the
+		 * convention used by writetup_heap.  The backward-scan seek
+		 * arithmetic in tuplestore_gettuple assumes this.
+		 */
+		tuplen = size + sizeof(unsigned int);
+		BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = 0;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		unsigned int datalen = len - sizeof(unsigned int);
+		void *data = palloc(datalen);
+
+		BufFileReadExact(state->myfile, data, datalen);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index f638b96e156..e16d9a3d352 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
 										  bool copy, TupleTableSlot *slot);
-- 
2.43.0



  [application/octet-stream] v34-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (98.1K, 5-v34-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 412f7e850d93f4ce27e7955bfca1076c4d0862bd Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v34 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  40 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 561 ++++++++++++++-------
 src/backend/catalog/index.c                | 322 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 345 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/backend/utils/misc/guc_parameters.dat  |   9 +
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/miscadmin.h                    |   1 +
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  42 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1156 insertions(+), 336 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 312374da5e0..a4186e8a22f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6792,6 +6792,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -6832,13 +6844,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -6855,8 +6866,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+        with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..12c88587a79 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform table scan followed by
+    validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as an <quote>invalid</quote> index into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind an <quote>invalid</quote> index and its
+    associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..9e0248261ae 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and its auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 07f07188d46..a3474925d61 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -42,15 +42,20 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
+
+/* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
+int			debug_cic_validate_store_mem_pct = 10;
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
-									 Relation OldHeap, Relation NewHeap,
-									 Datum *values, bool *isnull, RewriteState rwstate);
+                                     Relation OldHeap, Relation NewHeap,
+                                     Datum *values, bool *isnull, RewriteState rwstate);
 
 static bool SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 								   HeapTuple tuple,
@@ -1665,242 +1670,422 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+										   Tuplesortstate *aux,
+										   Tuplestorestate *store)
+{
+	int64		num = 0;
+	/* state variables for the merge */
+	ItemPointer	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Attempt to fetch the next TID from the auxiliary sort. If it's
+		 * empty, we set auxindexcursor to NULL.
+		 */
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		 * If the auxiliary sort is not yet empty, we now try to synchronize
+		 * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		 * the main sort cursor until we've reached or passed the auxiliary TID.
+		 */
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int64			num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber *tuples;
+	ReadStream *read_stream;
+
+	/* Use a percentage of maintenance_work_mem for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void **) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
+		 * It is safe to access tuple data after releasing the buffer lock
+		 * because the buffer pin is still held, and the only operation that
+		 * could physically move tuple data on the page is
+		 * PageRepairFragmentation via heap_page_prune.  VACUUM conflicts with
+		 * CIC (both take ShareUpdateExclusiveLock), and opportunistic pruning
+		 * from concurrent DML cannot affect root tuples we are referencing.
 		 */
-		if (hscan->rs_cblock != root_blkno)
-		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
+		 * No predicate evaluation is needed here: the auxiliary STIR index
+		 * only contains TIDs for tuples that already satisfied the partial
+		 * index predicate at DML time (checked in ExecInsertIndexTuples).
 		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cc067e58d36..b1417ec05c6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,6 +715,8 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  *		INDEX_CREATE_SUPPRESS_PROGRESS:
  *			don't report progress during the index build.
  *
@@ -723,6 +725,9 @@ UpdateIndexRelation(Oid indexoid,
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -763,6 +768,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	bool		progress = (flags & INDEX_CREATE_SUPPRESS_PROGRESS) == 0;
 	char		relkind;
 	TransactionId relfrozenxid;
@@ -789,7 +795,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -797,6 +806,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1402,7 +1416,8 @@ index_create_copy(Relation heapRelation, uint16 flags,
 							!concurrently,	/* isready */
 							concurrently,	/* concurrent */
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/* fetch exclusion constraint info if any */
 	if (indexRelation->rd_index->indisexclusion)
@@ -1422,13 +1437,16 @@ index_create_copy(Relation heapRelation, uint16 flags,
 	 * index information.  All this information will be used for the index
 	 * creation.
 	 */
-	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
 	{
 		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
-		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
 
-		indexColNames = lappend(indexColNames, NameStr(att->attname));
-		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
 	}
 
 	/* Extract opclass options for each attribute */
@@ -1490,6 +1508,157 @@ index_create_copy(Relation heapRelation, uint16 flags,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2470,7 +2639,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2530,7 +2700,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3309,12 +3480,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3324,14 +3504,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3339,12 +3522,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3362,22 +3547,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3410,6 +3599,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3434,15 +3624,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3465,27 +3689,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3494,6 +3721,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3554,6 +3782,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3825,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4067,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4092,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index eba25aa3e4d..5dcd318012e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1388,16 +1388,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9ab74c8df0a..2d7b6b7eb8b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1603,6 +1619,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1631,11 +1657,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1645,7 +1671,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1684,7 +1710,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1696,14 +1722,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure there are no transactions with the auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1738,9 +1794,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1758,24 +1833,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1802,7 +1867,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1827,6 +1892,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3598,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3703,8 +3816,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3756,8 +3876,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3818,6 +3945,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3921,15 +4055,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3980,6 +4117,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3997,11 +4139,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 									   tablespaceid,
 									   concurrentName);
 
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4010,6 +4158,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4028,10 +4177,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4112,13 +4265,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4165,6 +4365,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4172,12 +4407,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4215,7 +4444,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4244,7 +4473,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4335,14 +4564,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4367,6 +4596,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4380,11 +4631,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4404,6 +4655,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7a8a5d0764c..4f8761de6b9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -632,6 +632,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_store_mem_pct',
+  boot_val => '10',
+  min => '1',
+  max => '90',
+},
+
 { name => 'debug_copy_parse_plan_trees', type => 'bool', context => 'PGC_SUSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Set this to force all parse and plan trees to be passed through copyObject(), to facilitate catching errors and omissions in copyObject().',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4647785fd35..fafca930aae 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -738,7 +738,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1892,19 +1893,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 9aee8226347..3239e5c716f 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -72,6 +73,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
 #define INDEX_CREATE_SUPPRESS_PROGRESS		(1 << 7)
+#define INDEX_CREATE_AUXILIARY				(1 << 8)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid	index_create_copy(Relation heapRelation, uint16 flags,
 							  Oid oldIndexId, Oid tablespaceOid,
 							  const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -152,7 +159,7 @@ extern void index_build(Relation heapRelation,
 						bool parallel,
 						bool progress);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 67948667a97..35990693f39 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -117,14 +117,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7277c37e779..7ea643b7f80 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -268,6 +268,7 @@ extern PGDLLIMPORT bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..d1723f47e89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,44 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+ERROR:  relation "concur_reindex_tab4" does not exist
+LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+                    ^
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index f50868ca6a6..b34009f868c 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 81a73c426d2..ea52f0725c3 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2064,14 +2064,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..c2c1b031527 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v34-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (31.7K, 6-v34-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From aecd02fb2d73787706687add200069f57216d1d8 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v34 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |   8 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  78 +++++++++++----
 src/backend/catalog/pg_depend.c            |  62 ++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 +++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 105 +++++++++++++++++++--
 src/test/regress/sql/create_index.sql      |  57 ++++++++++-
 14 files changed, 380 insertions(+), 44 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 12c88587a79..406c02e866e 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    the recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 9e0248261ae..ac9cfec5c55 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -476,11 +476,15 @@ Indexes:
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
     recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, the recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b1417ec05c6..9136dfc7c73 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -780,6 +780,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1185,6 +1187,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1417,7 +1428,8 @@ index_create_copy(Relation heapRelation, uint16 flags,
 							concurrently,	/* concurrent */
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/* fetch exclusion constraint info if any */
 	if (indexRelation->rd_index->indisexclusion)
@@ -1601,7 +1613,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2640,7 +2653,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2701,7 +2715,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3783,8 +3798,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			indexForm->indisvalid = true;
 			break;
 		case INDEX_DROP_CLEAR_READY:
-			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
-			Assert(indexForm->indisready);
+			/*
+			 * Clear indisready during a CREATE INDEX CONCURRENTLY sequence.
+			 * indisready may already be false if the CIC failed before
+			 * index_concurrently_build had a chance to set it.
+			 */
 			Assert(!indexForm->indisvalid);
 			indexForm->indisready = false;
 			break;
@@ -3869,6 +3887,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3944,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4245,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4335,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4384,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..deacd2f7c95 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,67 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * Look for an AUTO dependency on a STIR index.  There can be at most
+		 * one STIR auxiliary per index, so we stop at the first match.
+		 * Transitive auxiliaries (e.g. ccnew_ccaux from a failed REINDEX
+		 * CONCURRENTLY) are found by calling this with the ccnew OID, and
+		 * are also cleaned up automatically via cascading AUTO dependency
+		 * when the intermediate index is dropped.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index d7ea86b2805..f428dcdf10f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -315,6 +315,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2d7b6b7eb8b..46c4ccc6789 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3711,6 +3712,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4060,6 +4062,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4067,6 +4070,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4144,12 +4148,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4159,6 +4168,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4180,10 +4190,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4372,7 +4390,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4395,6 +4414,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4614,6 +4636,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4665,6 +4689,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ce2e81f9c2..e2309b6a1ba 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ecaf82f2afa..f1605e00cdc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -232,6 +232,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index d1723f47e89..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3279,20 +3279,109 @@ ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-ERROR:  relation "concur_reindex_tab4" does not exist
-LINE 1: DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-                    ^
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-ERROR:  could not create unique index "aux_index_ind6"
-DETAIL:  Key (c1)=(1) is duplicated.
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
 WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
 HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
 NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c2c1b031527..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1344,11 +1344,62 @@ REINDEX INDEX aux_index_ind6_ccaux;
 -- Concurrently also
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
-DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v34-0006-Optimize-auxiliary-index-handling.patch (3.0K, 7-v34-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From 17191fcaeb5d38fbb2f4181e04814c62df6d771d Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v34 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 9 +++++++++
 src/backend/executor/execIndexing.c | 5 ++++-
 src/include/nodes/execnodes.h       | 6 ++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9136dfc7c73..4edf68aced2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2940,6 +2940,15 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		Assert(indexInfo->ii_Am == STIR_AM_OID);
+		memset(values, 0, sizeof(Datum) * indexInfo->ii_NumIndexAttrs);
+		memset(isnull, true, sizeof(bool) * indexInfo->ii_NumIndexAttrs);
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 4363e154c0f..84e99d653ec 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -438,8 +438,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && !indexInfo->ii_Auxiliary &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f1605e00cdc..62f797bc197 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -169,8 +169,10 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.  ii_Auxiliary
+ * is also used during retail inserts to skip datum formation for auxiliary
+ * indexes.
  * ----------------
  */
 typedef struct IndexInfo
-- 
2.43.0



  [application/octet-stream] v34-0007-Refresh-snapshot-periodically-during-index-valid.patch (27.1K, 8-v34-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 62a289139fb151848a5db5d89d87ed024ce6b5f2 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v34 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT         |  4 +-
 src/backend/access/heap/heapam_handler.c   | 77 +++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c      | 12 +++-
 src/backend/catalog/index.c                | 73 +++++++++++++++-----
 src/backend/commands/indexcmds.c           | 52 +++------------
 src/backend/utils/misc/guc_parameters.dat  |  9 +++
 src/include/access/tableam.h               | 25 ++++---
 src/include/access/transam.h               | 15 +++++
 src/include/catalog/index.h                |  2 +-
 src/include/miscadmin.h                    |  1 +
 src/test/regress/expected/create_index.out |  3 +
 src/test/regress/sql/create_index.sql      |  4 ++
 12 files changed, 194 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a3474925d61..8a9d94b1edd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -53,6 +53,9 @@
 /* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
 int			debug_cic_validate_store_mem_pct = 10;
 
+/* GUC: refresh snapshot every N pages during CIC validation (0 = disable) */
+int			debug_cic_validate_snapshot_pages = 4096;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
                                      Relation OldHeap, Relation NewHeap,
                                      Datum *values, bool *isnull, RewriteState rwstate);
@@ -1922,24 +1925,35 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int64			num_to_check;
+	int64			page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
+
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
 	ValidateIndexScanState callback_private_data;
 
 	Buffer buf;
@@ -1949,6 +1963,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use a percentage of maintenance_work_mem for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -1957,6 +1973,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!reset_snapshot || !HaveRegisteredOrActiveSnapshot());
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -1972,6 +1994,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2005,6 +2050,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2075,6 +2121,21 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+		if (reset_snapshot &&
+			debug_cic_validate_snapshot_pages > 0 &&
+			page_read_counter % debug_cic_validate_snapshot_pages == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* Advance limitXmin so we wait for all snapshots seen so far */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2084,11 +2145,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index c461f8dc02d..ef192fb99c2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4edf68aced2..49adcb152cf 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3538,8 +3539,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3552,7 +3554,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3565,21 +3567,24 @@ IndexCheckExclusion(Relation heapRelation,
  * before it declares a uniqueness error.
  *
  * After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index).  The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit.  Subsequent transactions
+ * will be able to use it for queries.
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3592,6 +3597,16 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
 	int			aux_work_mem_part = maintenance_work_mem / 10;
 
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
+#endif
+
 	{
 		const int	progress_index[] = {
 			PROGRESS_CREATEIDX_PHASE,
@@ -3629,8 +3644,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3666,6 +3685,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3685,7 +3707,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3694,6 +3722,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3713,19 +3744,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3748,6 +3784,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 46c4ccc6789..a700068f8a2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1816,32 +1815,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1863,8 +1841,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4433,7 +4411,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4448,13 +4425,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4466,16 +4436,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4485,10 +4446,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		CommitTransactionCommand();
 		StartTransactionCommand();
 
+		/* We should now definitely not be advertising any xmin. */
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 4f8761de6b9..c566b6040c9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -632,6 +632,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_snapshot_pages', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Refresh snapshot every N pages during CIC validation (0 to disable).',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_snapshot_pages',
+  boot_val => '4096',
+  min => '0',
+  max => '1000000',
+},
+
 { name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index fafca930aae..033237a9ce4 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -734,12 +734,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1897,20 +1896,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 55a4ab26b34..923aadbab43 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -415,6 +415,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 3239e5c716f..def7352a859 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -159,7 +159,7 @@ extern void index_build(Relation heapRelation,
 						bool parallel,
 						bool progress);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7ea643b7f80..8b3e7b21da1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -269,6 +269,7 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
+extern PGDLLIMPORT int debug_cic_validate_snapshot_pages;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 2d6abb15a89..758c5884ff5 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3382,6 +3382,9 @@ DROP INDEX aux_index_ind6;
 --------+---------+-----------+----------+---------
  c1     | integer |           |          | 
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fd96d80abbc..65dd58b947d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1400,6 +1400,10 @@ DROP INDEX aux_index_ind6;
 -- Make sure auxiliary index dropped too
 \d aux_index_tab5
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-06 18:21                                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-04-07 01:42                                                                           ` Josh Kupershmidt <[email protected]>
  2026-04-07 23:19                                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Josh Kupershmidt @ 2026-04-07 01:42 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hi,

I was interested in this feature, and took an initial look through the
patch. Sorry in advance that I'm missing some previous context from the
thread's history, I'm starting fresh here.

A few initial notes from looking at the v34 patches:

Usability and docs:
 * We're leaving behind two invalid indexes now that the user has to figure
out how to drop in case of an error - that seems like it could be confusing
for the user. Could we have some better way (error handler,
background worker) try to perform this cleanup automatically? If not, we
should at least tell the user clearly in the error message that both
invalid indexes are left behind (i.e. "idx" and "idx_ccaux" in the example)
 * Docs are inconsistent or confusing about whether there's one or two
indexes left behind in case of error - e.g. "command will fail but leave
behind *an* invalid index and its associated auxiliary index" - somewhat
implying there is only one invalid index, and somehow the auxiliary index
is valid?
 * Similarly, when the doc mentions e.g. "drop the index" - it's not
necessarily clear which index we're talking about when there are two
invalid indexes left behind that the user will see with `\d`
 * It would be nice to guard against users trying arbitrary CREATE INDEX
... USING stir(...) with a clear error

Few behavior notes and questions:
 * One of the testcases (line 2478 of patch 0004) does `DELETE FROM
concur_reindex_tab4 WHERE c1 = 1;` but the table `concur_reindex_tab4`
looks like it has been dropped a few lines above that?
 * The StirPageGetFreeSpace macro from patch 0002 reads
`StirPageGetMaxOffset(page)` which seems like it could cause an unsafe read
of opaque->maxoff if used on the metapage
 * A comment explains "No predicate evaluation is needed here" , i.e. we
are skipping predicate evaluation in the validation scan step, assuming
that the auxiliary index contains only qualifying TIDs. Is this really
bulletproof for e.g. partial indexes which may no longer satisfy the
predicate at the time of the validation scan due to conflicting HOT updates?

Thanks
Josh

On Mon, Apr 6, 2026 at 2:22 PM Mihail Nikalayeu <[email protected]>
wrote:

> Rebased once again.
>


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-06 18:21                                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-07 01:42                                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Josh Kupershmidt <[email protected]>
@ 2026-04-07 23:19                                                                             ` Mihail Nikalayeu <[email protected]>
  2026-04-11 16:56                                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-13 01:05                                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Josh Kupershmidt <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-04-07 23:19 UTC (permalink / raw)
  To: Josh Kupershmidt <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Josh!

Your review looks a bit LLM-generated, but anyway - thanks for review! :)
Especially because at least one point seems to be valid.

> We're leaving behind two invalid indexes now that the user has to figure
> out how to drop in case of an error - that seems like it could be confusing for the user.
> Could we have some better way (error handler, background worker) try to perform this cleanup automatically?
> If not, we should at least tell the user clearly in the error message that both
> invalid indexes are left behind (i.e. "idx" and "idx_ccaux" in the example)

Commit 0005 adds automatic dropping of auxiliary indexes when the
original index is reindexed or dropped. Also, documentation reflects
the ccaux index (similar to ccnew).

> Docs are inconsistent or confusing about whether there's one or two indexes left behind in case of error
> - e.g. "command will fail but leave behind *an* invalid index and its associated auxiliary index"
> somewhat implying there is only one invalid index, and somehow the auxiliary index is valid?

Auxiliary index is never marked as valid; I'm not sure we need to
highlight it here. Or do you have an idea how to rephrase?

> Similarly, when the doc mentions e.g. "drop the index" - it's not necessarily clear which index
> we're talking about when there are two invalid indexes left behind that the user will see with `\d`

In one commit it says: "method in such cases is to drop these indexes
and try again to perform".
After 0005 "The auxiliary index (suffixed with
<literal>_ccaux</literal>) will be automatically dropped when the main
index is dropped".
It seems clear to me, but feel free to provide your variant.

>  * It would be nice to guard against users trying arbitrary CREATE INDEX ... USING stir(...) with a clear error

It will fail with "Building STIR indexes is not supported".

> One of the testcases (line 2478 of patch 0004) does `DELETE FROM concur_reindex_tab4 WHERE c1 = 1;`
> but the table `concur_reindex_tab4` looks like it has been dropped a few lines above that?

Hm, yep, I'll fix it.

> The StirPageGetFreeSpace macro from patch 0002 reads `StirPageGetMaxOffset(page)`
> which seems like it could cause an unsafe read of opaque->maxoff if used on the metapage

But it was never used for the metapage.

> A comment explains "No predicate evaluation is needed here" , i.e. we are skipping predicate
> evaluation in the validation scan step, assuming that the
> auxiliary index contains only qualifying TIDs. Is this really bulletproof for e.g. partial indexes which may
> no longer satisfy the predicate at the time of the validation scan due to conflicting HOT updates?

Conflicting HOT updates are not possible because the catalog contains
the new index definition from the start of the process.
Or do you mean a different scenario?

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-06 18:21                                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-07 01:42                                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Josh Kupershmidt <[email protected]>
  2026-04-07 23:19                                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-04-11 16:56                                                                               ` Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2026-04-11 16:56 UTC (permalink / raw)
  To: Josh Kupershmidt <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hi!

Rebased, including fixes related to Josh's review.

Thanks!


Attachments:

  [application/octet-stream] v35-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch (97.5K, 2-v35-0004-Use-auxiliary-indexes-for-concurrent-index-opera.patch)
  download | inline diff:
From 707d09d1d73e0e1e770f4793db27d6a4370b3041 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 15:03:10 +0100
Subject: [PATCH v35 4/7] Use auxiliary indexes for concurrent index operations

Replace the second table full scan in concurrent index builds with an auxiliary index approach:
- create a STIR auxiliary index with the same predicate (if exists) as in main index
- use it to track tuples inserted during the first phase
- merge auxiliary index with main index during validation to catch up new index with any tuples missed during the first phase
- automatically drop auxiliary when main index is ready

To merge main and auxiliary indexes:
- index_bulk_delete called for both, TIDs put into tuplesort
- both tuplesort are being sorted
- both tuplesort scanned with two pointers looking for the TIDs present in auxiliary index, but absent in main one
- all such TIDs are put into tuplestore
- all TIDs in tuplestore are fetched using the stream, tuplestore used in heapam_index_validate_scan_read_stream_next to provide the next page to prefetch
- if fetched tuple is alive - it is inserted into the main index

This eliminates the need for a second full table scan during validation, improving performance, especially for large tables. Affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY operations.
---
 doc/src/sgml/monitoring.sgml               |  26 +-
 doc/src/sgml/ref/create_index.sgml         |  34 +-
 doc/src/sgml/ref/reindex.sgml              |  40 +-
 src/backend/access/heap/README.HOT         |  13 +-
 src/backend/access/heap/heapam_handler.c   | 557 ++++++++++++++-------
 src/backend/catalog/index.c                | 322 ++++++++++--
 src/backend/catalog/system_views.sql       |  17 +-
 src/backend/commands/indexcmds.c           | 345 +++++++++++--
 src/backend/nodes/makefuncs.c              |   4 +-
 src/backend/utils/misc/guc_parameters.dat  |   9 +
 src/include/access/tableam.h               |  12 +-
 src/include/catalog/index.h                |   9 +-
 src/include/commands/progress.h            |  13 +-
 src/include/miscadmin.h                    |   1 +
 src/include/nodes/makefuncs.h              |   3 +-
 src/test/regress/expected/create_index.out |  35 ++
 src/test/regress/expected/indexing.out     |   3 +-
 src/test/regress/expected/rules.out        |  17 +-
 src/test/regress/sql/create_index.sql      |  21 +
 19 files changed, 1147 insertions(+), 334 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 08d5b824552..1f2cd0d6f7e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6971,6 +6971,18 @@ FROM pg_stat_get_backend_idset() AS backendid;
        information for this phase.
       </entry>
      </row>
+     <row>
+      <entry><literal>waiting for writers to use auxiliary index</literal></entry>
+      <entry>
+       <command>CREATE INDEX CONCURRENTLY</command> or <command>REINDEX CONCURRENTLY</command> is waiting for transactions
+       with write locks that can potentially see the table to finish, to ensure use of auxiliary index for new tuples in
+       future transactions.
+       This phase is skipped when not in concurrent mode.
+       Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
+       and <structname>current_locker_pid</structname> contain the progress
+       information for this phase.
+      </entry>
+     </row>
      <row>
       <entry><literal>building index</literal></entry>
       <entry>
@@ -7011,13 +7023,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
       </entry>
      </row>
      <row>
-      <entry><literal>index validation: scanning table</literal></entry>
+      <entry><literal>index validation: merging indexes</literal></entry>
       <entry>
-       <command>CREATE INDEX CONCURRENTLY</command> is scanning the table
-       to validate the index tuples collected in the previous two phases.
+       <command>CREATE INDEX CONCURRENTLY</command> is merging content of auxiliary index with the target index.
        This phase is skipped when not in concurrent mode.
-       Columns <structname>blocks_total</structname> (set to the total size of the table)
-       and <structname>blocks_done</structname> contain the progress information for this phase.
+       Columns <structname>tuples_total</structname> (set to the number of tuples to be merged)
+       and <structname>tuples_done</structname> contain the progress information for this phase.
       </entry>
      </row>
      <row>
@@ -7034,8 +7045,9 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <row>
       <entry><literal>waiting for readers before marking dead</literal></entry>
       <entry>
-       <command>REINDEX CONCURRENTLY</command> is waiting for transactions
-       with read locks on the table to finish, before marking the old index dead.
+       <command>CREATE INDEX CONCURRENTLY</command> is waiting for transactions
+       with read locks on the table to finish, before marking the auxiliary index as dead.
+       <command>REINDEX CONCURRENTLY</command> is also waiting before marking the old index as dead.
        This phase is skipped when not in concurrent mode.
        Columns <structname>lockers_total</structname>, <structname>lockers_done</structname>
        and <structname>current_locker_pid</structname> contain the progress
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index bb7505d171b..901c6cf22bc 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -620,10 +620,10 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
     out writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>CREATE INDEX</command>.
     When this option is used,
-    <productname>PostgreSQL</productname> must perform two scans of the table, and in
-    addition it must wait for all existing transactions that could potentially
-    modify or use the index to terminate.  Thus
-    this method requires more total work than a standard index build and takes
+    <productname>PostgreSQL</productname> must perform a table scan followed by
+    a validation phase, and in addition it must wait for all existing transactions
+    that could potentially modify or use the index to terminate.  Thus
+    this method requires more total work than a standard index build and may take
     significantly longer to complete.  However, since it allows normal
     operations to continue while the index is built, this method is useful for
     adding new indexes in a production environment.  Of course, the extra CPU
@@ -631,14 +631,14 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    </para>
 
    <para>
-    In a concurrent index build, the index is actually entered as an
-    <quote>invalid</quote> index into
-    the system catalogs in one transaction, then two table scans occur in
-    two more transactions.  Before each table scan, the index build must
+    In a concurrent index build, the main and auxiliary indexes are actually
+    entered as <quote>invalid</quote> indexes into
+    the system catalogs in one transaction, then two phases occur in
+    multiple transactions.  Before each phase, the index build must
     wait for existing transactions that have modified the table to terminate.
-    After the second scan, the index build must wait for any transactions
+    After the second phase, the index build must wait for any transactions
     that have a snapshot (see <xref linkend="mvcc"/>) predating the second
-    scan to terminate, including transactions used by any phase of concurrent
+    phase to terminate, including transactions used by any phase of concurrent
     index builds on other tables, if the indexes involved are partial or have
     columns that are not simple column references.
     Then finally the index can be marked <quote>valid</quote> and ready for use,
@@ -651,10 +651,11 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
    <para>
     If a problem arises while scanning the table, such as a deadlock or a
     uniqueness violation in a unique index, the <command>CREATE INDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> index. This index
-    will be ignored for querying purposes because it might be incomplete;
-    however it will still consume update overhead. The <application>psql</application>
-    <command>\d</command> command will report such an index as <literal>INVALID</literal>:
+    command will fail but leave behind two <quote>invalid</quote> indexes:
+    the main index and its associated auxiliary index. These indexes
+    will be ignored for querying purposes because they might be incomplete;
+    however they will still consume update overhead. The <application>psql</application>
+    <command>\d</command> command will report such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -664,11 +665,12 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
+    method in such cases is to drop these indexes and try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
     to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
    </para>
 
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 185cd75ca30..56c9a0fe1f3 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -368,9 +368,8 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <productname>PostgreSQL</productname> supports rebuilding indexes with minimum locking
     of writes.  This method is invoked by specifying the
     <literal>CONCURRENTLY</literal> option of <command>REINDEX</command>. When this option
-    is used, <productname>PostgreSQL</productname> must perform two scans of the table
-    for each index that needs to be rebuilt and wait for termination of
-    all existing transactions that could potentially use the index.
+    is used, <productname>PostgreSQL</productname> must perform several steps to ensure data
+    consistency while allowing normal operations to continue.
     This method requires more total work than a standard index
     rebuild and takes significantly longer to complete as it needs to wait
     for unfinished transactions that might modify the index. However, since
@@ -388,7 +387,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
     <orderedlist>
      <listitem>
       <para>
-       A new transient index definition is added to the catalog
+       A new transient index definition and an auxiliary index are added to the catalog
        <literal>pg_index</literal>.  This definition will be used to replace
        the old index.  A <literal>SHARE UPDATE EXCLUSIVE</literal> lock at
        session level is taken on the indexes being reindexed as well as their
@@ -398,7 +397,15 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       A first pass to build the index is done for each new index.  Once the
+       The auxiliary index is marked as "ready for inserts", making
+       it visible to other sessions. This index efficiently tracks all new
+       tuples during the reindex process.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The new main index is built by scanning the table.  Once the
        index is built, its flag <literal>pg_index.indisready</literal> is
        switched to <quote>true</quote> to make it ready for inserts, making it
        visible to other sessions once the transaction that performed the build
@@ -409,9 +416,9 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       Then a second pass is performed to add tuples that were added while the
-       first pass was running.  This step is also done in a separate
-       transaction for each index.
+       A validation phase merges any missing entries from the auxiliary index
+       into the main index, ensuring all concurrent changes are captured.
+       This step is also done in a separate transaction for each index.
       </para>
      </listitem>
 
@@ -428,7 +435,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes have <literal>pg_index.indisready</literal> switched to
+       The old and auxiliary indexes have <literal>pg_index.indisready</literal> switched to
        <quote>false</quote> to prevent any new tuple insertions, after waiting
        for running queries that might reference the old index to complete.
       </para>
@@ -436,7 +443,7 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
 
      <listitem>
       <para>
-       The old indexes are dropped.  The <literal>SHARE UPDATE
+       The old and auxiliary indexes are dropped.  The <literal>SHARE UPDATE
        EXCLUSIVE</literal> session locks for the indexes and the table are
        released.
       </para>
@@ -447,11 +454,11 @@ REINDEX [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] { DA
    <para>
     If a problem arises while rebuilding the indexes, such as a
     uniqueness violation in a unique index, the <command>REINDEX</command>
-    command will fail but leave behind an <quote>invalid</quote> new index in addition to
-    the pre-existing one. This index will be ignored for querying purposes
-    because it might be incomplete; however it will still consume update
+    command will fail but leave behind an <quote>invalid</quote> new index and an <quote>invalid</quote> auxiliary index in addition to
+    the pre-existing one. These indexes will be ignored for querying purposes
+    because they might be incomplete; however they will still consume update
     overhead. The <application>psql</application> <command>\d</command> command will report
-    such an index as <literal>INVALID</literal>:
+    such indexes as <literal>INVALID</literal>:
 
 <programlisting>
 postgres=# \d tab
@@ -462,12 +469,13 @@ postgres=# \d tab
 Indexes:
     "idx" btree (col)
     "idx_ccnew" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>
 
     If the index marked <literal>INVALID</literal> is suffixed
-    <literal>_ccnew</literal>, then it corresponds to the transient
+    <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop it using <literal>DROP INDEX</literal>,
+    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
     then attempt <command>REINDEX CONCURRENTLY</command> again.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..b1c797517ee 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It is marked as
+"ready for inserts" without any actual table scan. Its purpose is to collect
+new tuples inserted into table while our target index is still "not ready
+for inserts".
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,10 +399,10 @@ entry at the root of the HOT-update chain but we use the key value from the
 live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if they are visible to reference snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 20d3b46e062..8cbc4855078 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -42,11 +42,16 @@
 #include "storage/lmgr.h"
 #include "storage/lock.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
+#include "utils/tuplestore.h"
+
+/* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
+int			debug_cic_validate_store_mem_pct = 10;
 
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
@@ -1714,242 +1719,422 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
+/*
+ * Calculate set difference (relative complement) of main and aux
+ * sets.
+ *
+ * All records which are present in auxiliary tuplesort but not in
+ * main are added to the store.
+ *
+ * In set theory notation store = aux - main or store = aux / main.
+ *
+ * returns number of items added to store
+ */
+static int64
+heapam_index_validate_tuplesort_difference(Tuplesortstate *main,
+										   Tuplesortstate *aux,
+										   Tuplestorestate *store)
+{
+	int64		num = 0;
+	/* state variables for the merge */
+	ItemPointer	indexcursor = NULL,
+					auxindexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+
+	/*
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must compared to those from the "main" sort
+	 * (state->tuplesort).
+	 */
+	while (!auxtuplesort_empty)
+	{
+		Datum		ts_val;
+		bool		ts_isnull;
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Attempt to fetch the next TID from the auxiliary sort. If it's
+		 * empty, we set auxindexcursor to NULL.
+		 */
+		auxtuplesort_empty = !tuplesort_getdatum(aux, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
+		{
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
+		}
+		else
+		{
+			auxindexcursor = NULL;
+		}
+
+		/*
+		 * If the auxiliary sort is not yet empty, we now try to synchronize
+		 * the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		 * the main sort cursor until we've reached or passed the auxiliary TID.
+		 */
+		if (!auxtuplesort_empty)
+		{
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
+			{
+				/*
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
+				 */
+				tuplesort_empty = !tuplesort_getdatum(main, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
+			}
+
+			/*
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
+			 */
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
+			{
+				tuplestore_putdatum(store, Int64GetDatum(itemptr_encode(auxindexcursor)));
+				num++;
+			}
+		}
+	}
+
+	return num;
+}
+
+typedef struct ValidateIndexScanState
+{
+	Tuplestorestate		*store;
+	BlockNumber			prev_block_number;
+	OffsetNumber		prev_offset_number;
+} ValidateIndexScanState;
+
+/*
+ * This is ReadStreamBlockNumberCB implementation which works as follows:
+ *
+ * 1) It iterates over a sorted tuplestore, where each element is an encoded
+ *    ItemPointer
+ *
+ * 2) It returns the current BlockNumber and collects all OffsetNumbers
+ *    for that block in per_buffer_data.
+ *
+ * 3) Once the code encounters a new BlockNumber, it stops reading more
+ *    offsets and saves the OffsetNumber of the new block for the next call.
+ *
+ * 4) The list of offsets for a block is always terminated with InvalidOffsetNumber.
+ *
+ * This function is intended to be repeatedly called, each time returning
+ * the next block and its corresponding set of offsets.
+ */
+static BlockNumber
+heapam_index_validate_scan_read_stream_next(
+								  ReadStream *stream,
+								  void *void_callback_private_data,
+								  void *void_per_buffer_data
+								  )
+{
+	bool should_free;
+	Datum datum;
+	BlockNumber result = InvalidBlockNumber;
+	int i = 0;
+
+	/*
+	 * Retrieve the specialized callback state and the output buffer.
+	 * callback_private_data keeps track of the previous block and offset
+	 * from a prior invocation, if any.
+	 */
+	ValidateIndexScanState *callback_private_data = void_callback_private_data;
+	OffsetNumber *per_buffer_data = void_per_buffer_data;
+
+	/*
+	 * If there is a "leftover" offset number from the previous invocation,
+	 * it means we had switched to a new block in the middle of the last call.
+	 * We place that leftover offset number into the buffer first.
+	 */
+	if (callback_private_data->prev_offset_number != InvalidOffsetNumber)
+	{
+		Assert(callback_private_data->prev_block_number != InvalidBlockNumber);
+		/*
+		 * 'result' is the block number to return. We set it to the block
+		 * from the previous leftover offset.
+		 */
+		result = callback_private_data->prev_block_number;
+		/* Place leftover offset number in the output buffer. */
+		per_buffer_data[i++] = callback_private_data->prev_offset_number;
+		/*
+		 * Clear the leftover offset number so it won't be reused unless
+		 * we encounter another block change.
+		 */
+		callback_private_data->prev_offset_number = InvalidOffsetNumber;
+	}
+
+	/*
+	 * Read from the tuplestore until we either run out of tuples or we
+	 * encounter a block change. For each tuple:
+	 *
+	 *   1) Decode its block/offset from the Datum.
+	 *   2) If it's the first time in this call (prev_block_number == InvalidBlockNumber),
+	 *      initialize prev_block_number.
+	 *   3) If the block number matches the current block, collect the offset.
+	 *   4) If the block number differs, save that offset as leftover and break
+	 *      so that the next call can handle the new block.
+	 */
+	while (tuplestore_getdatum(callback_private_data->store, true, &should_free, &datum))
+	{
+		BlockNumber next_block_number;
+		ItemPointerData next_data;
+
+		/* Decode the datum into an ItemPointer (block + offset). */
+		itemptr_decode(&next_data, DatumGetInt64(datum));
+		next_block_number = ItemPointerGetBlockNumber(&next_data);
+
+		/*
+		 * If we haven't set a block number yet this round, initialize it
+		 * using the first tuple we read.
+		 */
+		if (callback_private_data->prev_block_number == InvalidBlockNumber)
+			callback_private_data->prev_block_number = next_block_number;
+
+		/*
+		 * Always set the result to be the "current" block number
+		 * we are filling offsets for.
+		 */
+		result = callback_private_data->prev_block_number;
+
+		/*
+		 * If this tuple is from the same block, just store its offset
+		 * in our per_buffer_data array.
+		 */
+		if (next_block_number == callback_private_data->prev_block_number)
+		{
+			per_buffer_data[i++] = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+		}
+		else
+		{
+			/*
+			 * If the block just changed, store the offset of the new block
+			 * as leftover for the next invocation and break out.
+			 */
+			callback_private_data->prev_block_number = next_block_number;
+			callback_private_data->prev_offset_number = ItemPointerGetOffsetNumber(&next_data);
+
+			/* Free the datum if needed. */
+			if (should_free)
+				pfree(DatumGetPointer(datum));
+
+			/* Break to let the next call handle the new block. */
+			break;
+		}
+	}
+
+	/*
+	 * Terminate the list of offsets for this block with an InvalidOffsetNumber.
+	 */
+	per_buffer_data[i] = InvalidOffsetNumber;
+	return result;
+}
+
 static void
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
 						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
-
-	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
+	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
+
+	int64			num_to_check;
+	Tuplestorestate *tuples_for_check;
+	ValidateIndexScanState callback_private_data;
+
+	Buffer buf;
+	OffsetNumber *tuples;
+	ReadStream *read_stream;
+
+	/* Use a percentage of maintenance_work_mem for tuple store. */
+	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
+
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
+	num_to_check = heapam_index_validate_tuplesort_difference(state->tuplesort,
+														 auxState->tuplesort,
+														 tuples_for_check);
+
+	/* It is our responsibility to close tuple sort as fast as we can */
+	tuplesort_end(state->tuplesort);
+	tuplesort_end(auxState->tuplesort);
+
+	state->tuplesort = auxState->tuplesort = NULL;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	callback_private_data.prev_block_number = InvalidBlockNumber;
+	callback_private_data.store = tuples_for_check;
+	callback_private_data.prev_offset_number = InvalidOffsetNumber;
 
-	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
-	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
-	hscan = (HeapScanDesc) scan;
+	read_stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE | READ_STREAM_USE_BATCHING,
+														 bstrategy,
+														 heapRelation, MAIN_FORKNUM,
+														 heapam_index_validate_scan_read_stream_next,
+														 &callback_private_data,
+														 (MaxHeapTuplesPerPage + 1) * sizeof(OffsetNumber));
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_TOTAL, num_to_check);
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE, 0);
 
-	/*
-	 * Scan all tuples matching the snapshot.
-	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while ((buf = read_stream_next_buffer(read_stream, (void **) &tuples)) != InvalidBuffer)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
+		HeapTupleData	heap_tuple_data[MaxHeapTuplesPerPage];
+		int i;
+		OffsetNumber off;
+		BlockNumber block_number;
 
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
+		LockBuffer(buf, BUFFER_LOCK_SHARE);
+		block_number = BufferGetBlockNumber(buf);
 
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
+			ItemPointerData tid;
+			bool		all_dead, found;
+			ItemPointerSet(&tid, block_number, off);
+
+			found = heap_hot_search_buffer(&tid, heapRelation, buf, snapshot,
+										   &heap_tuple_data[i], &all_dead, true);
+			if (!found)
+				ItemPointerSetInvalid(&heap_tuple_data[i].t_self);
+			i++;
+			state->htups += 1;
 		}
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
+		 * It is safe to access tuple data after releasing the buffer lock
+		 * because the buffer pin is still held, and the only operation that
+		 * could physically move tuple data on the page is
+		 * PageRepairFragmentation via heap_page_prune.  VACUUM conflicts with
+		 * CIC (both take ShareUpdateExclusiveLock), and opportunistic pruning
+		 * from concurrent DML cannot affect root tuples we are referencing.
 		 */
-		if (hscan->rs_cblock != root_blkno)
-		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
-		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
-		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
-		}
-
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
+		 * No predicate evaluation is needed here: the auxiliary STIR index
+		 * only contains TIDs for tuples that already satisfied the partial
+		 * index predicate at DML time (checked in ExecInsertIndexTuples).
 		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		i = 0;
+		while ((off = tuples[i]) != InvalidOffsetNumber)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			if (ItemPointerIsValid(&heap_tuple_data[i].t_self))
 			{
+				ItemPointerData root_tid;
+				ItemPointerSet(&root_tid, block_number, off);
+
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				ExecStoreBufferHeapTuple(&heap_tuple_data[i], slot, buf);
+
+				/* Compute the key values and null flags for this tuple. */
+				FormIndexDatum(indexInfo,
+							   slot,
+							   estate,
+							   values,
+							   isnull);
+
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Insert the tuple into the target index.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
+				index_insert(indexRelation,
+							 values,
+							 isnull,
+							 &root_tid, /* insert root tuple */
+							 heapRelation,
+							 indexInfo->ii_Unique ?
+							 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+							 false,
+							 indexInfo);
+
+				state->tups_inserted += 1;
 			}
 
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
-			}
+			pgstat_progress_incr_param(PROGRESS_CREATEIDX_TUPLES_DONE, 1);
+			i++;
 		}
 
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
-
-			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
-			 */
-			if (predicate != NULL)
-			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
-
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
-
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
-
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
-		}
+		ReleaseBuffer(buf);
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	read_stream_end(read_stream);
+	tuplestore_end(tuples_for_check);
+
+	FreeAccessStrategy(bstrategy);
+
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index cc067e58d36..b1417ec05c6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -715,6 +715,8 @@ UpdateIndexRelation(Oid indexoid,
  *			already exists.
  *		INDEX_CREATE_PARTITIONED:
  *			create a partitioned index (table must be partitioned)
+ *		INDEX_CREATE_AUXILIARY:
+ *			mark index as auxiliary index
  *		INDEX_CREATE_SUPPRESS_PROGRESS:
  *			don't report progress during the index build.
  *
@@ -723,6 +725,9 @@ UpdateIndexRelation(Oid indexoid,
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it should be equal to the persistence level of the table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -763,6 +768,7 @@ index_create(Relation heapRelation,
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	bool		progress = (flags & INDEX_CREATE_SUPPRESS_PROGRESS) == 0;
 	char		relkind;
 	TransactionId relfrozenxid;
@@ -789,7 +795,10 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
+	if (auxiliary)
+		relpersistence = RELPERSISTENCE_UNLOGGED; /* aux indexes are always unlogged */
+	else
+		relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -797,6 +806,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1402,7 +1416,8 @@ index_create_copy(Relation heapRelation, uint16 flags,
 							!concurrently,	/* isready */
 							concurrently,	/* concurrent */
 							indexRelation->rd_indam->amsummarizing,
-							oldInfo->ii_WithoutOverlaps);
+							oldInfo->ii_WithoutOverlaps,
+							false);
 
 	/* fetch exclusion constraint info if any */
 	if (indexRelation->rd_index->indisexclusion)
@@ -1422,13 +1437,16 @@ index_create_copy(Relation heapRelation, uint16 flags,
 	 * index information.  All this information will be used for the index
 	 * creation.
 	 */
-	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
 	{
 		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
-		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
 
-		indexColNames = lappend(indexColNames, NameStr(att->attname));
-		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
 	}
 
 	/* Extract opclass options for each attribute */
@@ -1490,6 +1508,157 @@ index_create_copy(Relation heapRelation, uint16 flags,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false,	/* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false,	/* aux are not summarizing */
+							false,	/* aux are not without overlaps */
+							true	/* auxiliary */);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+
+		for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+		{
+			Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+			indexColNames = lappend(indexColNames, NameStr(att->attname));
+			newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+		}
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL);
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -2470,7 +2639,8 @@ BuildIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2530,7 +2700,8 @@ BuildDummyIndexInfo(Relation index)
 					   indexStruct->indisready,
 					   false,
 					   index->rd_indam->amsummarizing,
-					   indexStruct->indisexclusion && indexStruct->indisunique);
+					   indexStruct->indisexclusion && indexStruct->indisunique,
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3309,12 +3480,21 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After that, we build the auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We commit transaction and
+ * again wait for all transactions that could have been modifying the table
+ * to terminate. At that moment all new tuples are going to be inserted into
+ * auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3324,14 +3504,17 @@ IndexCheckExclusion(Relation heapRelation,
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
- * does not contain any tuples added to the table while we built the index.
+ * does not contain any tuples added to the table while we built the index
+ * (but these tuples contained in auxiliary index).
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
- * commit the second transaction and start a third.  Again we wait for all
+ * commit the third transaction and start a fourth.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At the same moment we clear "indisready" for
+ * auxiliary index, since it is no more required to be updated.
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3339,12 +3522,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3362,22 +3547,26 @@ IndexCheckExclusion(Relation heapRelation,
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
  * transactions will be able to use it for queries.
  *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * Also, some actions to concurrent drop the auxiliary index are performed.
  */
 void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * 10% rest for auxiliary.
+	 *
+	 * Rest 10% will be used for tuplestore later. */
+	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
+	int			aux_work_mem_part = maintenance_work_mem / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3410,6 +3599,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
@@ -3434,15 +3624,49 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   aux_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+	/* If aux index is empty, merge may be skipped */
+	if (auxState.itups == 0)
+	{
+		tuplesort_end(auxState.tuplesort);
+		auxState.tuplesort = NULL;
+
+		/* Roll back any GUC changes executed by index functions */
+		AtEOXact_GUC(false, save_nestlevel);
+
+		/* Restore userid and security context */
+		SetUserIdAndSecContext(save_userid, save_sec_context);
+
+		/* Close rels, but keep locks */
+		index_close(auxIndexRelation, NoLock);
+		index_close(indexRelation, NoLock);
+		table_close(heapRelation, NoLock);
+
+		return;
+	}
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											(int) main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3465,27 +3689,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
 	table_index_validate_scan(heapRelation,
 							  indexRelation,
 							  indexInfo,
 							  snapshot,
-							  &state);
+							  &state,
+							  &auxState);
 
-	/* Done with tuplesort object */
-	tuplesort_end(state.tuplesort);
+	/* Tuple sort closed by table_index_validate_scan */
+	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3494,6 +3721,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
 }
@@ -3554,6 +3782,12 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
@@ -3825,6 +4059,13 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		indexInfo->ii_ExclusionStrats = NULL;
 	}
 
+	/* Auxiliary indexes are not allowed to be rebuilt */
+	if (indexInfo->ii_Auxiliary)
+		ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("reindex of auxiliary index \"%s\" not supported",
+					RelationGetRelationName(iRel))));
+
 	/* Suppress use of the target index while rebuilding it */
 	SetReindexProcessing(heapId, indexId);
 
@@ -4067,6 +4308,7 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
+		Oid			indexAm = get_rel_relam(indexOid);
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4092,6 +4334,18 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
+		if (indexAm == STIR_AM_OID)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+							get_namespace_name(indexNamespaceId),
+							get_rel_name(indexOid))));
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			continue;
+		}
+
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 73a1c1c4670..23d292aaced 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1409,16 +1409,17 @@ CREATE VIEW pg_stat_progress_create_index AS
                       END AS command,
         CASE S.param10 WHEN 0 THEN 'initializing'
                        WHEN 1 THEN 'waiting for writers before build'
-                       WHEN 2 THEN 'building index' ||
+                       WHEN 2 THEN 'waiting for writers to use auxiliary index'
+                       WHEN 3 THEN 'building index' ||
                            COALESCE((': ' || pg_indexam_progress_phasename(S.param9::oid, S.param11)),
                                     '')
-                       WHEN 3 THEN 'waiting for writers before validation'
-                       WHEN 4 THEN 'index validation: scanning index'
-                       WHEN 5 THEN 'index validation: sorting tuples'
-                       WHEN 6 THEN 'index validation: scanning table'
-                       WHEN 7 THEN 'waiting for old snapshots'
-                       WHEN 8 THEN 'waiting for readers before marking dead'
-                       WHEN 9 THEN 'waiting for readers before dropping'
+                       WHEN 4 THEN 'waiting for writers before validation'
+                       WHEN 5 THEN 'index validation: scanning index'
+                       WHEN 6 THEN 'index validation: sorting tuples'
+                       WHEN 7 THEN 'index validation: merging indexes'
+                       WHEN 8 THEN 'waiting for old snapshots'
+                       WHEN 9 THEN 'waiting for readers before marking dead'
+                       WHEN 10 THEN 'waiting for readers before dropping'
                        END as phase,
         S.param4 AS lockers_total,
         S.param5 AS lockers_done,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 9ab74c8df0a..2d7b6b7eb8b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -183,6 +183,7 @@ CheckIndexCompatible(Oid oldId,
 					 bool isWithoutOverlaps)
 {
 	bool		isconstraint;
+	bool		isauxiliary;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
 	Oid		   *opclassIds;
@@ -233,6 +234,7 @@ CheckIndexCompatible(Oid oldId,
 
 	amcanorder = amRoutine->amcanorder;
 	amsummarizing = amRoutine->amsummarizing;
+	isauxiliary = accessMethodId == STIR_AM_OID;
 
 	/*
 	 * Compute the operator classes, collations, and exclusion operators for
@@ -244,7 +246,8 @@ CheckIndexCompatible(Oid oldId,
 	 */
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
-							  false, false, amsummarizing, isWithoutOverlaps);
+							  false, false, amsummarizing,
+							  isWithoutOverlaps, isauxiliary);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -557,6 +560,7 @@ DefineIndex(ParseState *pstate,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -566,6 +570,7 @@ DefineIndex(ParseState *pstate,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -587,6 +592,7 @@ DefineIndex(ParseState *pstate,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
@@ -837,6 +843,15 @@ DefineIndex(ParseState *pstate,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -931,7 +946,8 @@ DefineIndex(ParseState *pstate,
 							  !concurrent,
 							  concurrent,
 							  amissummarizing,
-							  stmt->iswithoutoverlaps);
+							  stmt->iswithoutoverlaps,
+							  false);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -1603,6 +1619,16 @@ DefineIndex(ParseState *pstate,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1631,11 +1657,11 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1645,7 +1671,7 @@ DefineIndex(ParseState *pstate,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1684,7 +1710,7 @@ DefineIndex(ParseState *pstate,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1696,14 +1722,44 @@ DefineIndex(ParseState *pstate,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	PushActiveSnapshot(GetTransactionSnapshot());
+
+	index_concurrently_build(tableId, auxIndexRelationId);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure there are no transactions with the auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure that all new tuples in table are inserted into
+	 * the auxiliary index. Now it is time to build the target index itself.
+	 *
 	 * We now take a new snapshot, and build the index using all tuples that
 	 * are visible in this snapshot.  We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
@@ -1738,9 +1794,28 @@ DefineIndex(ParseState *pstate,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/*
+	 * Now target index is marked as "ready" for all transactions. So, auxiliary
+	 * index is no longer needed. So, start removing process by reverting "ready"
+	 * flag.
+	 */
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
 	/*
 	 * Now take the "reference snapshot" that will be used by validate_index()
 	 * to filter candidate tuples.  Beware!  There might still be snapshots in
@@ -1758,24 +1833,14 @@ DefineIndex(ParseState *pstate,
 	 */
 	snapshot = RegisterSnapshot(GetTransactionSnapshot());
 	PushActiveSnapshot(snapshot);
-
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
-	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
+	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
 	limitXmin = snapshot->xmin;
 
 	PopActiveSnapshot();
 	UnregisterSnapshot(snapshot);
-
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1802,7 +1867,7 @@ DefineIndex(ParseState *pstate,
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -1827,6 +1892,53 @@ DefineIndex(ParseState *pstate,
 	 * to replan; so relcache flush on the index itself was sufficient.)
 	 */
 	CacheInvalidateRelcacheByRelid(heaprelid.relId);
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+	/* Now wait for all transaction to see auxiliary as "non-ready for inserts" */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_6);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
 
 	/*
 	 * Last thing to do is release the session-level lock on the parent table.
@@ -3598,6 +3710,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3703,8 +3816,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 					Oid			cellOid = lfirst_oid(lc);
 					Relation	indexRelation = index_open(cellOid,
 														   ShareUpdateExclusiveLock);
+					IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-					if (!indexRelation->rd_index->indisvalid)
+
+					if (indexInfo->ii_Auxiliary)
+						ereport(WARNING,(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+							 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+									get_namespace_name(get_rel_namespace(cellOid)),
+									get_rel_name(cellOid))));
+					else if (!indexRelation->rd_index->indisvalid)
 						ereport(WARNING,
 								(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 								 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3756,8 +3876,15 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 						Oid			cellOid = lfirst_oid(lc2);
 						Relation	indexRelation = index_open(cellOid,
 															   ShareUpdateExclusiveLock);
+						IndexInfo*	indexInfo = BuildDummyIndexInfo(indexRelation);
 
-						if (!indexRelation->rd_index->indisvalid)
+						if (indexInfo->ii_Auxiliary)
+							ereport(WARNING,
+									(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+									 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
+											get_namespace_name(get_rel_namespace(cellOid)),
+											get_rel_name(cellOid))));
+						else if (!indexRelation->rd_index->indisvalid)
 							ereport(WARNING,
 									(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 									 errmsg("skipping reindex of invalid index \"%s.%s\"",
@@ -3818,6 +3945,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot reindex invalid index on TOAST table")));
 
+				/* Auxiliary indexes are not allowed to be rebuilt */
+				if (get_rel_relam(relationOid) == STIR_AM_OID)
+					ereport(ERROR,
+						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						 errmsg("reindex of auxiliary index \"%s\" not supported",
+								get_rel_name(relationOid))));
+
 				/*
 				 * Check if parent relation can be locked and if it exists,
 				 * this needs to be done at this stage as the list of indexes
@@ -3921,15 +4055,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3980,6 +4117,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3997,11 +4139,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 									   tablespaceid,
 									   concurrentName);
 
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   newIndexId,
+												   tablespaceid,
+												   auxConcurrentName);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4010,6 +4158,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4028,10 +4177,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4112,13 +4265,60 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Set ActiveSnapshot since functions in the indexes may need it */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4165,6 +4365,41 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
@@ -4172,12 +4407,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4215,7 +4444,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
+		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
 
 		/*
 		 * We can now do away with our active snapshot, we still need to save
@@ -4244,7 +4473,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4335,14 +4564,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex-relation-concurrently-before-set-dead", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4367,6 +4596,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4380,11 +4631,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4404,6 +4655,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 5359dab1176..84f7cf9824e 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps)
+			  bool withoutoverlaps, bool auxiliary)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -850,6 +850,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Concurrent = concurrent;
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
+	n->ii_Auxiliary = auxiliary;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
@@ -875,7 +876,6 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
-	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 83af594d4af..3477866d729 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -640,6 +640,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_store_mem_pct',
+  boot_val => '10',
+  min => '1',
+  max => '90',
+},
+
 { name => 'debug_copy_parse_plan_trees', type => 'bool', context => 'PGC_SUSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Set this to force all parse and plan trees to be passed through copyObject(), to facilitate catching errors and omissions in copyObject().',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index c13f05d39db..da3598663bc 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -743,7 +743,8 @@ typedef struct TableAmRoutine
 										Relation index_rel,
 										IndexInfo *index_info,
 										Snapshot snapshot,
-										ValidateIndexState *state);
+										ValidateIndexState *state,
+										ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1906,19 +1907,24 @@ table_index_build_range_scan(Relation table_rel,
  * table_index_validate_scan - second table scan for concurrent index build
  *
  * See validate_index() for an explanation.
+ *
+ * Note: it is responsibility of that function to close sortstates in
+ * both `state` and `auxstate`.
  */
 static inline void
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
 						  Snapshot snapshot,
-						  ValidateIndexState *state)
+						  ValidateIndexState *state,
+						  ValidateIndexState *auxstate)
 {
 	table_rel->rd_tableam->index_validate_scan(table_rel,
 											   index_rel,
 											   index_info,
 											   snapshot,
-											   state);
+											   state,
+											   auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 9aee8226347..3239e5c716f 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -31,6 +31,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -72,6 +73,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
 #define INDEX_CREATE_SUPPRESS_PROGRESS		(1 << 7)
+#define INDEX_CREATE_AUXILIARY				(1 << 8)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -106,6 +108,11 @@ extern Oid	index_create_copy(Relation heapRelation, uint16 flags,
 							  Oid oldIndexId, Oid tablespaceOid,
 							  const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
 									 Oid indexRelationId);
 
@@ -152,7 +159,7 @@ extern void index_build(Relation heapRelation,
 						bool parallel,
 						bool progress);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 2a12920c75f..daac9f4f34e 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -120,14 +120,15 @@
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
 #define PROGRESS_CREATEIDX_PHASE_WAIT_1			1
-#define PROGRESS_CREATEIDX_PHASE_BUILD			2
-#define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
-#define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
+#define PROGRESS_CREATEIDX_PHASE_WAIT_2			2
+#define PROGRESS_CREATEIDX_PHASE_BUILD			3
+#define PROGRESS_CREATEIDX_PHASE_WAIT_3			4
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	5
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8ccdf61246b..8c2b3a9c5e7 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -271,6 +271,7 @@ extern PGDLLIMPORT bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index bf54d39feb0..cd7f1eb0592 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -99,7 +99,8 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								List *expressions, List *predicates,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
-								bool summarizing, bool withoutoverlaps);
+								bool summarizing, bool withoutoverlaps,
+								bool auxiliary);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 55538c4c41e..937c3b48a0f 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1437,6 +1437,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3211,6 +3212,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3223,8 +3225,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -3252,6 +3256,37 @@ Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1)
 
 DROP TABLE concur_reindex_tab4;
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index f50868ca6a6..b34009f868c 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a65a5bf0c4f..9800b9f1440 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2079,14 +2079,15 @@ pg_stat_progress_create_index| SELECT s.pid,
         CASE s.param10
             WHEN 0 THEN 'initializing'::text
             WHEN 1 THEN 'waiting for writers before build'::text
-            WHEN 2 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
-            WHEN 3 THEN 'waiting for writers before validation'::text
-            WHEN 4 THEN 'index validation: scanning index'::text
-            WHEN 5 THEN 'index validation: sorting tuples'::text
-            WHEN 6 THEN 'index validation: scanning table'::text
-            WHEN 7 THEN 'waiting for old snapshots'::text
-            WHEN 8 THEN 'waiting for readers before marking dead'::text
-            WHEN 9 THEN 'waiting for readers before dropping'::text
+            WHEN 2 THEN 'waiting for writers to use auxiliary index'::text
+            WHEN 3 THEN ('building index'::text || COALESCE((': '::text || pg_indexam_progress_phasename((s.param9)::oid, s.param11)), ''::text))
+            WHEN 4 THEN 'waiting for writers before validation'::text
+            WHEN 5 THEN 'index validation: scanning index'::text
+            WHEN 6 THEN 'index validation: sorting tuples'::text
+            WHEN 7 THEN 'index validation: merging indexes'::text
+            WHEN 8 THEN 'waiting for old snapshots'::text
+            WHEN 9 THEN 'waiting for readers before marking dead'::text
+            WHEN 10 THEN 'waiting for readers before dropping'::text
             ELSE NULL::text
         END AS phase,
     s.param4 AS lockers_total,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 82e4062a215..805d2eb8485 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -503,6 +503,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1315,10 +1316,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
@@ -1330,6 +1333,24 @@ REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
 DROP TABLE concur_reindex_tab4;
 
+-- Check handling of auxiliary indexes
+CREATE TABLE aux_index_tab5 (c1 int);
+INSERT INTO aux_index_tab5 VALUES (1), (1), (2);
+-- This trick creates an invalid index and auxiliary index for it
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+\d aux_index_tab5
+-- Not allowed to reindex auxiliary index
+REINDEX INDEX aux_index_ind6_ccaux;
+-- Concurrently also
+REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex
+REINDEX TABLE aux_index_tab5;
+-- Should be skipped during concurrent reindex
+REINDEX TABLE CONCURRENTLY aux_index_tab5;
+DROP TABLE aux_index_tab5;
+
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
 -- definitions.
-- 
2.43.0



  [application/octet-stream] v35-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch (31.6K, 3-v35-0005-Track-and-drop-auxiliary-indexes-in-DROP-REINDEX.patch)
  download | inline diff:
From a3cb8e3e33c03904d455ac986cc0ee0be41ad0e4 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Tue, 31 Dec 2024 14:36:31 +0100
Subject: [PATCH v35 5/7] Track and drop auxiliary indexes in DROP/REINDEX

During concurrent index operations, auxiliary indexes may be left as orphaned objects when errors occur (junk auxiliary indexes).

This patch improves the handling of such auxiliary indexes:
- add auxiliaryForIndexId parameter to index_create() to track dependencies between main and auxiliary indexes
- automatically drop auxiliary indexes when the main index is dropped
- delete junk auxiliary indexes properly during REINDEX operations
---
 doc/src/sgml/ref/create_index.sgml         |  14 ++-
 doc/src/sgml/ref/reindex.sgml              |  10 +-
 src/backend/catalog/dependency.c           |   2 +-
 src/backend/catalog/index.c                |  78 ++++++++++++----
 src/backend/catalog/pg_depend.c            |  62 +++++++++++++
 src/backend/catalog/toasting.c             |   1 +
 src/backend/commands/indexcmds.c           |  37 +++++++-
 src/backend/commands/tablecmds.c           |  52 ++++++++++-
 src/backend/nodes/makefuncs.c              |   3 +-
 src/include/catalog/dependency.h           |   1 +
 src/include/nodes/execnodes.h              |   2 +
 src/include/nodes/makefuncs.h              |   2 +-
 src/test/regress/expected/create_index.out | 102 ++++++++++++++++++++-
 src/test/regress/sql/create_index.sql      |  55 ++++++++++-
 14 files changed, 382 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 901c6cf22bc..b0407c840b3 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -668,10 +668,16 @@ Indexes:
     "idx_ccaux" stir (col) INVALID
 </programlisting>
 
-    The recommended recovery
-    method in such cases is to drop these indexes and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the main index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    the recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>
 
    <para>
diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 56c9a0fe1f3..297b8b5fde2 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -475,12 +475,16 @@ Indexes:
     If the index marked <literal>INVALID</literal> is suffixed
     <literal>_ccnew</literal> or <literal>_ccaux</literal>, then it corresponds to the transient or auxiliary
     index created during the concurrent operation, and the recommended
-    recovery method is to drop these indexes using <literal>DROP INDEX</literal>,
-    then attempt <command>REINDEX CONCURRENTLY</command> again.
+    recovery method is to drop the transient index using <literal>DROP INDEX</literal>,
+    then attempt <command>REINDEX CONCURRENTLY</command> again. The auxiliary index
+    (suffixed with <literal>_ccaux</literal>) will be automatically dropped
+    along with its main index.
     If the invalid index is instead suffixed <literal>_ccold</literal>,
     it corresponds to the original index which could not be dropped;
     the recommended recovery method is to just drop said index, since the
-    rebuild proper has been successful.
+    rebuild proper has been successful. If the only
+    invalid index is one suffixed <literal>_ccaux</literal>, the recommended
+    recovery method is just <literal>DROP INDEX</literal> for that index.
     A nonzero number may be appended to the suffix of the invalid index
     names to keep them unique, like <literal>_ccnew1</literal>,
     <literal>_ccold2</literal>, etc.
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index fdb8e67e1f5..c6941fb19d1 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -292,7 +292,7 @@ performDeletion(const ObjectAddress *object,
 	 * Acquire deletion lock on the target object.  (Ideally the caller has
 	 * done this already, but many places are sloppy about it.)
 	 */
-	AcquireDeletionLock(object, 0);
+	AcquireDeletionLock(object, flags);
 
 	/*
 	 * Construct a list of objects to delete (ie, the given object plus
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b1417ec05c6..9136dfc7c73 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -780,6 +780,8 @@ index_create(Relation heapRelation,
 		   ((flags & INDEX_CREATE_ADD_CONSTRAINT) != 0));
 	/* partitioned indexes must never be "built" by themselves */
 	Assert(!partitioned || (flags & INDEX_CREATE_SKIP_BUILD));
+	/* ii_AuxiliaryForIndexId and INDEX_CREATE_AUXILIARY are required both or neither */
+	Assert(OidIsValid(indexInfo->ii_AuxiliaryForIndexId) == auxiliary);
 
 	relkind = partitioned ? RELKIND_PARTITIONED_INDEX : RELKIND_INDEX;
 	is_exclusion = (indexInfo->ii_ExclusionOps != NULL);
@@ -1185,6 +1187,15 @@ index_create(Relation heapRelation,
 			recordDependencyOn(&myself, &referenced, DEPENDENCY_PARTITION_SEC);
 		}
 
+		/*
+		 * Record dependency on the main index in case of auxiliary index.
+		 */
+		if (OidIsValid(indexInfo->ii_AuxiliaryForIndexId))
+		{
+			ObjectAddressSet(referenced, RelationRelationId, indexInfo->ii_AuxiliaryForIndexId);
+			recordDependencyOn(&myself, &referenced, DEPENDENCY_AUTO);
+		}
+
 		/* placeholder for normal dependencies */
 		addrs = new_object_addresses();
 
@@ -1417,7 +1428,8 @@ index_create_copy(Relation heapRelation, uint16 flags,
 							concurrently,	/* concurrent */
 							indexRelation->rd_indam->amsummarizing,
 							oldInfo->ii_WithoutOverlaps,
-							false);
+							false,
+							InvalidOid);
 
 	/* fetch exclusion constraint info if any */
 	if (indexRelation->rd_index->indisexclusion)
@@ -1601,7 +1613,8 @@ index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
 							true,
 							false,	/* aux are not summarizing */
 							false,	/* aux are not without overlaps */
-							true	/* auxiliary */);
+							true	/* auxiliary */,
+							mainIndexId /* auxiliaryForIndexId */);
 
 	/*
 	 * Extract the list of column names and the column numbers for the new
@@ -2640,7 +2653,8 @@ BuildIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid /* auxiliary_for_index_id is set only during build */);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -2701,7 +2715,8 @@ BuildDummyIndexInfo(Relation index)
 					   false,
 					   index->rd_indam->amsummarizing,
 					   indexStruct->indisexclusion && indexStruct->indisunique,
-					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */);
+					   index->rd_rel->relam == STIR_AM_OID /* auxiliary iff STIR */,
+					   InvalidOid);
 
 	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
@@ -3783,8 +3798,11 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			indexForm->indisvalid = true;
 			break;
 		case INDEX_DROP_CLEAR_READY:
-			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
-			Assert(indexForm->indisready);
+			/*
+			 * Clear indisready during a CREATE INDEX CONCURRENTLY sequence.
+			 * indisready may already be false if the CIC failed before
+			 * index_concurrently_build had a chance to set it.
+			 */
 			Assert(!indexForm->indisvalid);
 			indexForm->indisready = false;
 			break;
@@ -3869,6 +3887,7 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 				heapRelation;
 	Oid			heapId;
 	Oid			save_userid;
+	Oid			junkAuxIndexId;
 	int			save_sec_context;
 	int			save_nestlevel;
 	IndexInfo  *indexInfo;
@@ -3925,6 +3944,19 @@ reindex_index(const ReindexStmt *stmt, Oid indexId,
 		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
 	}
 
+	/* Check for the auxiliary index for that index, it needs to be dropped */
+	junkAuxIndexId = get_auxiliary_index(indexId);
+	if (OidIsValid(junkAuxIndexId))
+	{
+		ObjectAddress object;
+		object.classId = RelationRelationId;
+		object.objectId = junkAuxIndexId;
+		object.objectSubId = 0;
+		performDeletion(&object, DROP_RESTRICT,
+								 PERFORM_DELETION_INTERNAL |
+								 PERFORM_DELETION_QUIETLY);
+	}
+
 	/*
 	 * Open the target index relation and get an exclusive lock on it, to
 	 * ensure that no one else is touching this particular index.
@@ -4213,7 +4245,8 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 {
 	Relation	rel;
 	Oid			toast_relid;
-	List	   *indexIds;
+	List	   *indexIds,
+			   *auxIndexIds = NIL;
 	char		persistence;
 	bool		result = false;
 	ListCell   *indexId;
@@ -4302,13 +4335,30 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 	else
 		persistence = rel->rd_rel->relpersistence;
 
+	foreach(indexId, indexIds)
+	{
+		Oid			indexOid = lfirst_oid(indexId);
+		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* All STIR indexes are auxiliary indexes */
+		if (indexAm == STIR_AM_OID)
+		{
+			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
+				RemoveReindexPending(indexOid);
+			auxIndexIds = lappend_oid(auxIndexIds, indexOid);
+		}
+	}
+
 	/* Reindex all the indexes. */
 	i = 1;
 	foreach(indexId, indexIds)
 	{
 		Oid			indexOid = lfirst_oid(indexId);
 		Oid			indexNamespaceId = get_rel_namespace(indexOid);
-		Oid			indexAm = get_rel_relam(indexOid);
+
+		/* Auxiliary indexes are going to be dropped during main index rebuild */
+		if (list_member_oid(auxIndexIds, indexOid))
+			continue;
 
 		/*
 		 * Skip any invalid indexes on a TOAST table.  These can only be
@@ -4334,18 +4384,6 @@ reindex_relation(const ReindexStmt *stmt, Oid relid, int flags,
 			continue;
 		}
 
-		if (indexAm == STIR_AM_OID)
-		{
-			ereport(WARNING,
-					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-					 errmsg("skipping reindex of auxiliary index \"%s.%s\"",
-							get_namespace_name(indexNamespaceId),
-							get_rel_name(indexOid))));
-			if (flags & REINDEX_REL_SUPPRESS_INDEX_USE)
-				RemoveReindexPending(indexOid);
-			continue;
-		}
-
 		reindex_index(stmt, indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 					  persistence, params);
 
diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 07c2d41c189..deacd2f7c95 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -20,6 +20,7 @@
 #include "catalog/catalog.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
+#include "catalog/pg_am_d.h"
 #include "catalog/pg_constraint.h"
 #include "catalog/pg_depend.h"
 #include "catalog/pg_extension.h"
@@ -1108,6 +1109,67 @@ get_index_constraint(Oid indexId)
 	return constraintId;
 }
 
+/*
+ * get_auxiliary_index
+ *		Given the OID of an index, return the OID of its auxiliary
+ *		index, or InvalidOid if there is no auxiliary index.
+ */
+Oid
+get_auxiliary_index(Oid indexId)
+{
+	Oid			auxiliaryIndexOid = InvalidOid;
+	Relation	depRel;
+	ScanKeyData key[3];
+	SysScanDesc scan;
+	HeapTuple	tup;
+
+	/* Search the dependency table for the index */
+	depRel = table_open(DependRelationId, AccessShareLock);
+
+	ScanKeyInit(&key[0],
+				Anum_pg_depend_refclassid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(RelationRelationId));
+	ScanKeyInit(&key[1],
+				Anum_pg_depend_refobjid,
+				BTEqualStrategyNumber, F_OIDEQ,
+				ObjectIdGetDatum(indexId));
+	ScanKeyInit(&key[2],
+				Anum_pg_depend_refobjsubid,
+				BTEqualStrategyNumber, F_INT4EQ,
+				Int32GetDatum(0));
+
+	scan = systable_beginscan(depRel, DependReferenceIndexId, true,
+							  NULL, 3, key);
+
+	while (HeapTupleIsValid(tup = systable_getnext(scan)))
+	{
+		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
+
+		/*
+		 * Look for an AUTO dependency on a STIR index.  There can be at most
+		 * one STIR auxiliary per index, so we stop at the first match.
+		 * Transitive auxiliaries (e.g. ccnew_ccaux from a failed REINDEX
+		 * CONCURRENTLY) are found by calling this with the ccnew OID, and
+		 * are also cleaned up automatically via cascading AUTO dependency
+		 * when the intermediate index is dropped.
+		 */
+		if (deprec->classid == RelationRelationId &&
+			(deprec->deptype == DEPENDENCY_AUTO) &&
+			get_rel_relkind(deprec->objid) == RELKIND_INDEX &&
+			get_rel_relam(deprec->objid) == STIR_AM_OID)
+		{
+			auxiliaryIndexOid = deprec->objid;
+			break;
+		}
+	}
+
+	systable_endscan(scan);
+	table_close(depRel, AccessShareLock);
+
+	return auxiliaryIndexOid;
+}
+
 /*
  * get_index_ref_constraints
  *		Given the OID of an index, return the OID of all foreign key
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index d7ea86b2805..f428dcdf10f 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -315,6 +315,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
 	indexInfo->ii_Auxiliary = false;
+	indexInfo->ii_AuxiliaryForIndexId = InvalidOid;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2d7b6b7eb8b..46c4ccc6789 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -247,7 +247,7 @@ CheckIndexCompatible(Oid oldId,
 	indexInfo = makeIndexInfo(numberOfAttributes, numberOfAttributes,
 							  accessMethodId, NIL, NIL, false, false,
 							  false, false, amsummarizing,
-							  isWithoutOverlaps, isauxiliary);
+							  isWithoutOverlaps, isauxiliary, InvalidOid);
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
 	opclassIds = palloc_array(Oid, numberOfAttributes);
@@ -947,7 +947,8 @@ DefineIndex(ParseState *pstate,
 							  concurrent,
 							  amissummarizing,
 							  stmt->iswithoutoverlaps,
-							  false);
+							  false,
+							  InvalidOid);
 
 	typeIds = palloc_array(Oid, numberOfAttributes);
 	collationIds = palloc_array(Oid, numberOfAttributes);
@@ -3711,6 +3712,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		Oid			indexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -4060,6 +4062,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
 		Oid			auxIndexId;
+		Oid			junkAuxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
@@ -4067,6 +4070,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		int			save_nestlevel;
 		Relation	newIndexRel;
 		Relation	auxIndexRel;
+		Relation	junkAuxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -4144,12 +4148,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 												   tablespaceid,
 												   auxConcurrentName);
 
+		/* Search for auxiliary index for reindexed index, to drop it */
+		junkAuxIndexId = get_auxiliary_index(idx->indexId);
+
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
 		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
+		if (OidIsValid(junkAuxIndexId))
+			junkAuxIndexRel = index_open(junkAuxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -4159,6 +4168,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
+		newidx->junkAuxIndexId = junkAuxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -4180,10 +4190,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		if (OidIsValid(junkAuxIndexId))
+		{
+			lockrelid = palloc_object(LockRelId);
+			*lockrelid = junkAuxIndexRel->rd_lockInfo.lockRelId;
+			relationLocks = lappend(relationLocks, lockrelid);
+		}
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		if (OidIsValid(junkAuxIndexId))
+			index_close(junkAuxIndexRel, NoLock);
 		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
@@ -4372,7 +4390,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	/*
 	 * At this moment all target indexes are marked as "ready-to-insert". So,
-	 * we are free to start process of dropping auxiliary indexes.
+	 * we are free to start process of dropping auxiliary indexes - including
+	 * junk indexes detected earlier.
 	 */
 	foreach(lc, newIndexIds)
 	{
@@ -4395,6 +4414,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		PushActiveSnapshot(GetTransactionSnapshot());
 		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		/* Ensure the junk index is marked as non-ready */
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_set_state_flags(newidx->junkAuxIndexId, INDEX_DROP_CLEAR_READY);
 		PopActiveSnapshot();
 
 		CommitTransactionCommand();
@@ -4614,6 +4636,8 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PushActiveSnapshot(GetTransactionSnapshot());
 
 		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+		if (OidIsValid(newidx->junkAuxIndexId))
+			index_concurrently_set_dead(newidx->tableId, newidx->junkAuxIndexId);
 
 		PopActiveSnapshot();
 	}
@@ -4665,6 +4689,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			object.objectSubId = 0;
 
 			add_exact_object_address(&object, objects);
+
+			if (OidIsValid(idx->junkAuxIndexId))
+			{
+				object.objectId = idx->junkAuxIndexId;
+				object.objectSubId = 0;
+				add_exact_object_address(&object, objects);
+			}
 		}
 
 		/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eec09ba1ded..eaae8f7ca5f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1567,6 +1567,8 @@ RemoveRelations(DropStmt *drop)
 	ListCell   *cell;
 	int			flags = 0;
 	LOCKMODE	lockmode = AccessExclusiveLock;
+	MemoryContext private_context,
+				  oldcontext;
 
 	/* DROP CONCURRENTLY uses a weaker lock, and has some restrictions */
 	if (drop->concurrent)
@@ -1631,9 +1633,20 @@ RemoveRelations(DropStmt *drop)
 			relkind = 0;		/* keep compiler quiet */
 			break;
 	}
+	/*
+	 * Create a memory context that will survive forced transaction commits we
+	 * may need to do below (in case of concurrent index drop).
+	 * Since it is a child of PortalContext, it will go away eventually even if
+	 * we suffer an error; there's no need for special abort cleanup logic.
+	 */
+	private_context = AllocSetContextCreate(PortalContext,
+											"RemoveRelations",
+											ALLOCSET_SMALL_SIZES);
 
+	oldcontext = MemoryContextSwitchTo(private_context);
 	/* Lock and validate each relation; build a list of object addresses */
 	objects = new_object_addresses();
+	MemoryContextSwitchTo(oldcontext);
 
 	foreach(cell, drop->objects)
 	{
@@ -1685,6 +1698,38 @@ RemoveRelations(DropStmt *drop)
 			flags |= PERFORM_DELETION_CONCURRENTLY;
 		}
 
+		/*
+		 * Concurrent index drop requires it to be the first transaction. But in
+		 * case we have junk auxiliary index - we want to drop it too (and also
+		 * in a concurrent way). In this case perform silent internal deletion
+		 * of auxiliary index, and restore transaction state. It is fine to do it
+		 * in the loop because there is only single element in drop->objects.
+		 */
+		if ((flags & PERFORM_DELETION_CONCURRENTLY) != 0 &&
+			state.actual_relkind == RELKIND_INDEX)
+		{
+			Oid junkAuxIndexOid = get_auxiliary_index(relOid);
+			if (OidIsValid(junkAuxIndexOid))
+			{
+				ObjectAddress object;
+				object.classId = RelationRelationId;
+				object.objectId = junkAuxIndexOid;
+				object.objectSubId = 0;
+				performDeletion(&object, DROP_RESTRICT,
+										 PERFORM_DELETION_CONCURRENTLY |
+										 PERFORM_DELETION_INTERNAL |
+										 PERFORM_DELETION_QUIETLY);
+				CommitTransactionCommand();
+				MemoryContextDelete(private_context);
+
+				/* And start again - now without auxiliary index. */
+				StartTransactionCommand();
+				PushActiveSnapshot(GetTransactionSnapshot());
+				RemoveRelations(drop);
+				return;
+			}
+		}
+
 		/*
 		 * Concurrent index drop cannot be used with partitioned indexes,
 		 * either.
@@ -1713,12 +1758,17 @@ RemoveRelations(DropStmt *drop)
 		obj.objectId = relOid;
 		obj.objectSubId = 0;
 
+		oldcontext = MemoryContextSwitchTo(private_context);
 		add_exact_object_address(&obj, objects);
+		MemoryContextSwitchTo(oldcontext);
 	}
 
+	/* Deletion may involve multiple commits, so, switch to memory context */
+	oldcontext = MemoryContextSwitchTo(private_context);
 	performMultipleDeletions(objects, drop->behavior, flags);
+	MemoryContextSwitchTo(oldcontext);
 
-	free_object_addresses(objects);
+	MemoryContextDelete(private_context);
 }
 
 /*
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 84f7cf9824e..c54748ff644 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -834,7 +834,7 @@ IndexInfo *
 makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 			  List *predicates, bool unique, bool nulls_not_distinct,
 			  bool isready, bool concurrent, bool summarizing,
-			  bool withoutoverlaps, bool auxiliary)
+			  bool withoutoverlaps, bool auxiliary, Oid auxiliary_for_index_id)
 {
 	IndexInfo  *n = makeNode(IndexInfo);
 
@@ -851,6 +851,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	n->ii_Summarizing = summarizing;
 	n->ii_WithoutOverlaps = withoutoverlaps;
 	n->ii_Auxiliary = auxiliary;
+	n->ii_AuxiliaryForIndexId = auxiliary_for_index_id;
 
 	/* summarizing indexes cannot contain non-key attributes */
 	Assert(!summarizing || (numkeyattrs == numattrs));
diff --git a/src/include/catalog/dependency.h b/src/include/catalog/dependency.h
index 2f3c1eae3c7..6ae210c584e 100644
--- a/src/include/catalog/dependency.h
+++ b/src/include/catalog/dependency.h
@@ -193,6 +193,7 @@ extern List *getOwnedSequences(Oid relid);
 extern Oid	getIdentitySequence(Relation rel, AttrNumber attnum, bool missing_ok);
 
 extern Oid	get_index_constraint(Oid indexId);
+extern Oid	get_auxiliary_index(Oid indexId);
 
 extern List *get_index_ref_constraints(Oid indexId);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3eaeed3c141..af58d4cf4b5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -232,6 +232,8 @@ typedef struct IndexInfo
 	int			ii_ParallelWorkers;
 	/* is auxiliary for concurrent index build? */
 	bool		ii_Auxiliary;
+	/* if creating an auxiliary index, the OID of the main index; otherwise InvalidOid. */
+	Oid			ii_AuxiliaryForIndexId;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index cd7f1eb0592..3a704781c8b 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -100,7 +100,7 @@ extern IndexInfo *makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid,
 								bool unique, bool nulls_not_distinct,
 								bool isready, bool concurrent,
 								bool summarizing, bool withoutoverlaps,
-								bool auxiliary);
+								bool auxiliary, Oid auxiliary_for_index_id);
 
 extern Node *makeStringConst(char *str, int location);
 extern DefElem *makeDefElem(char *name, Node *arg, int location);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 937c3b48a0f..2d6abb15a89 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3280,12 +3280,108 @@ REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 ERROR:  reindex of auxiliary index "aux_index_ind6_ccaux" not supported
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
-WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+WARNING:  skipping reindex of invalid index "public.aux_index_ind6"
+HINT:  Use DROP INDEX or REINDEX INDEX.
 WARNING:  skipping reindex of auxiliary index "public.aux_index_ind6_ccaux"
+NOTICE:  table "aux_index_tab5" has no indexes that can be reindexed concurrently
+-- Make sure it is still exists
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1) INVALID
+    "aux_index_ind6_ccaux" stir (c1) INVALID
+
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+Indexes:
+    "aux_index_ind6" UNIQUE, btree (c1)
+
+DROP INDEX aux_index_ind6;
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
+DROP INDEX aux_index_ind6;
+ERROR:  index "aux_index_ind6" does not exist
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+ERROR:  could not create unique index "aux_index_ind6"
+DETAIL:  Key (c1)=(1) is duplicated.
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+           Table "public.aux_index_tab5"
+ Column |  Type   | Collation | Nullable | Default 
+--------+---------+-----------+----------+---------
+ c1     | integer |           |          | 
+
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index 805d2eb8485..fd96d80abbc 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1345,10 +1345,61 @@ REINDEX INDEX aux_index_ind6_ccaux;
 REINDEX INDEX CONCURRENTLY aux_index_ind6_ccaux;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
 -- Should be skipped during reindex
-REINDEX TABLE aux_index_tab5;
--- Should be skipped during concurrent reindex
 REINDEX TABLE CONCURRENTLY aux_index_tab5;
+-- Make sure it is still exists
+\d aux_index_tab5
+-- Should be skipped during reindex and dropped
+REINDEX TABLE aux_index_tab5;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Should be skipped during reindex and dropped
+REINDEX INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure aux index is dropped
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- This makes the previous failure go away, so the index can become valid.
+DELETE FROM aux_index_tab5 WHERE c1 = 1;
+-- Drop main index CONCURRENTLY
+DROP INDEX CONCURRENTLY aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+DROP INDEX aux_index_ind6;
+
+-- Insert duplicates again
+INSERT INTO aux_index_tab5 VALUES (1), (1);
+-- Create invalid index again
+CREATE UNIQUE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+-- Drop main index
+DROP INDEX aux_index_ind6;
+-- Make sure auxiliary index dropped too
+\d aux_index_tab5
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



  [application/octet-stream] v35-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch (21.0K, 4-v35-0003-Add-Datum-storage-support-to-tuplestore-Extend-t.patch)
  download | inline diff:
From c705ccc819f3d4ef26407f0eb7fe9e3da56f2304 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 12 Jan 2026 00:57:56 +0300
Subject: [PATCH v35 3/7] Add Datum storage support to tuplestore Extend
 tuplestore to store individual Datum values

This support enables usages of tuplestore for non-tuple data (TIDs) in the next commit.
---
 src/backend/utils/sort/tuplestore.c | 367 +++++++++++++++++++++++-----
 src/include/utils/tuplestore.h      |  33 +--
 2 files changed, 327 insertions(+), 73 deletions(-)

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index f9e2d95186a..2a9b25bd238 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -1,16 +1,19 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.c
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
+ *
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
  * Also, it is possible to support multiple independent read pointers.
  *
  * A temporary file is used to handle the data if it exceeds the
@@ -61,6 +64,8 @@
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
+#include "utils/datum.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 #include "utils/tuplestore.h"
@@ -116,16 +121,15 @@ struct Tuplestorestate
 	BufFile    *myfile;			/* underlying file, or NULL if none */
 	MemoryContext context;		/* memory context for holding tuples */
 	ResourceOwner resowner;		/* resowner for holding temp files */
+	Oid			datumType;		/* InvalidOid or oid of Datum's to be stored */
+	int16		datumTypeLen;	/* typelen of that Datum */
+	bool		datumTypeByVal; /* by-value of that Datum */
 
 	/*
 	 * These function pointers decouple the routines that must know what kind
 	 * of tuple we are handling from the routines that don't need to know it.
 	 * They are set up by the tuplestore_begin_xxx routines.
 	 *
-	 * (Although tuplestore.c currently only supports heap tuples, I've copied
-	 * this part of tuplesort.c so that extension to other kinds of objects
-	 * will be easy if it's ever needed.)
-	 *
 	 * Function to copy a supplied input tuple into palloc'd space. (NB: we
 	 * assume that a single pfree() is enough to release the tuple later, so
 	 * the representation must be "flat" in one palloc chunk.) state->availMem
@@ -150,6 +154,12 @@ struct Tuplestorestate
 	 */
 	void	   *(*readtup) (Tuplestorestate *state, unsigned int len);
 
+	/*
+	 * Function to get length of tuple from tape. Used to provide 'len' argument
+	 * for readtup (see above).
+	 */
+	unsigned int(*lentup) (Tuplestorestate *state, bool eofOK);
+
 	/*
 	 * This array holds pointers to tuples in memory if we are in state INMEM.
 	 * In states WRITEFILE and READFILE it's not used.
@@ -186,6 +196,7 @@ struct Tuplestorestate
 #define COPYTUP(state,tup)	((*(state)->copytup) (state, tup))
 #define WRITETUP(state,tup) ((*(state)->writetup) (state, tup))
 #define READTUP(state,len)	((*(state)->readtup) (state, len))
+#define LENTUP(state,eofOK)	((*(state)->lentup) (state, eofOK))
 #define LACKMEM(state)		((state)->availMem < 0)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
@@ -194,9 +205,9 @@ struct Tuplestorestate
  *
  * NOTES about on-tape representation of tuples:
  *
- * We require the first "unsigned int" of a stored tuple to be the total size
- * on-tape of the tuple, including itself (so it is never zero).
- * The remainder of the stored tuple
+ * In case of tuples we use first "unsigned int" of a stored tuple
+ * to be the total size on-tape of the tuple, including itself
+ * (so it is never zero). The remainder of the stored tuple
  * may or may not match the in-memory representation of the tuple ---
  * any conversion needed is the job of the writetup and readtup routines.
  *
@@ -207,10 +218,13 @@ struct Tuplestorestate
  * state->backward is not set, the write/read routines may omit the extra
  * length word.
  *
- * writetup is expected to write both length words as well as the tuple
+ * In the case of Datum with constant length, both "unsigned int" are omitted.
+ *
+ * writetup is expected to write both length words and the tuple
  * data.  When readtup is called, the tape is positioned just after the
- * front length word; readtup must read the tuple data and advance past
- * the back length word (if present).
+ * front length word (if it is not omitted like in case of content-size Datum);
+ * readtup must read the tuple data and advance past the back length word
+ * (if present).
  *
  * The write/read routines can make use of the tuple description data
  * stored in the Tuplestorestate record, if needed. They are also expected
@@ -242,11 +256,16 @@ static Tuplestorestate *tuplestore_begin_common(int eflags,
 static void tuplestore_puttuple_common(Tuplestorestate *state, void *tuple);
 static void dumptuples(Tuplestorestate *state);
 static void tuplestore_updatemax(Tuplestorestate *state);
-static unsigned int getlen(Tuplestorestate *state, bool eofOK);
+
+static unsigned int lentup_heap(Tuplestorestate *state, bool eofOK);
 static void *copytup_heap(Tuplestorestate *state, void *tup);
 static void writetup_heap(Tuplestorestate *state, void *tup);
 static void *readtup_heap(Tuplestorestate *state, unsigned int len);
 
+static unsigned int lentup_datum(Tuplestorestate *state, bool eofOK);
+static void *copytup_datum(Tuplestorestate *state, void *datum);
+static void writetup_datum(Tuplestorestate *state, void *datum);
+static void *readtup_datum(Tuplestorestate *state, unsigned int len);
 
 /*
  *		tuplestore_begin_xxx
@@ -269,6 +288,12 @@ tuplestore_begin_common(int eflags, bool interXact, int maxKBytes)
 	state->allowedMem = maxKBytes * (int64) 1024;
 	state->availMem = state->allowedMem;
 	state->myfile = NULL;
+	/*
+	 * Set Datum related data to invalid by default.
+	 */
+	state->datumType = InvalidOid;
+	state->datumTypeLen = 0;
+	state->datumTypeByVal = false;
 
 	/*
 	 * The palloc/pfree pattern for tuple memory is in a FIFO pattern.  A
@@ -346,6 +371,37 @@ tuplestore_begin_heap(bool randomAccess, bool interXact, int maxKBytes)
 	state->copytup = copytup_heap;
 	state->writetup = writetup_heap;
 	state->readtup = readtup_heap;
+	state->lentup = lentup_heap;
+
+	return state;
+}
+
+/*
+ * The same as tuplestore_begin_heap but create store for Datum values.
+ */
+Tuplestorestate *
+tuplestore_begin_datum(Oid datumType, bool randomAccess, bool interXact, int maxKBytes)
+{
+	Tuplestorestate *state;
+	int			eflags;
+
+	/*
+	 * This interpretation of the meaning of randomAccess is compatible with
+	 * the pre-8.3 behavior of tuplestores.
+	 */
+	eflags = randomAccess ?
+		(EXEC_FLAG_BACKWARD | EXEC_FLAG_REWIND) :
+		(EXEC_FLAG_REWIND);
+
+	state = tuplestore_begin_common(eflags, interXact, maxKBytes);
+	state->datumType = datumType;
+	get_typlenbyval(state->datumType, &state->datumTypeLen, &state->datumTypeByVal);
+	Assert(!(state->datumTypeByVal && randomAccess));
+
+	state->copytup = copytup_datum;
+	state->writetup = writetup_datum;
+	state->readtup = readtup_datum;
+	state->lentup = lentup_datum;
 
 	return state;
 }
@@ -444,16 +500,19 @@ tuplestore_clear(Tuplestorestate *state)
 	{
 		int64		availMem = state->availMem;
 
-		/*
-		 * Below, we reset the memory context for storing tuples.  To save
-		 * from having to always call GetMemoryChunkSpace() on all stored
-		 * tuples, we adjust the availMem to forget all the tuples and just
-		 * recall USEMEM for the space used by the memtuples array.  Here we
-		 * just Assert that's correct and the memory tracking hasn't gone
-		 * wrong anywhere.
-		 */
-		for (i = state->memtupdeleted; i < state->memtupcount; i++)
-			availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			/*
+			 * Below, we reset the memory context for storing tuples.  To save
+			 * from having to always call GetMemoryChunkSpace() on all stored
+			 * tuples, we adjust the availMem to forget all the tuples and just
+			 * recall USEMEM for the space used by the memtuples array.  Here we
+			 * just Assert that's correct and the memory tracking hasn't gone
+			 * wrong anywhere.
+			 */
+			for (i = state->memtupdeleted; i < state->memtupcount; i++)
+				availMem += GetMemoryChunkSpace(state->memtuples[i]);
+		}
 
 		availMem += GetMemoryChunkSpace(state->memtuples);
 
@@ -777,6 +836,25 @@ tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple)
 	MemoryContextSwitchTo(oldcxt);
 }
 
+/*
+ * Like tuplestore_puttupleslot but for single Datum.
+ */
+void
+tuplestore_putdatum(Tuplestorestate *state, Datum datum)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(state->context);
+
+	/*
+	 * Copy the Datum.  (Must do this even in WRITEFILE case.  Note that
+	 * COPYTUP includes USEMEM, so we needn't do that here.)
+	 */
+	datum = PointerGetDatum(COPYTUP(state, DatumGetPointer(datum)));
+
+	tuplestore_puttuple_common(state, DatumGetPointer(datum));
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
 /*
  * Similar to tuplestore_puttuple(), but work from values + nulls arrays.
  * This avoids an extra tuple-construction operation.
@@ -1028,10 +1106,10 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 			pg_fallthrough;
 
 		case TSS_READFILE:
-			*should_free = true;
+			*should_free = !state->datumTypeByVal;
 			if (forward)
 			{
-				if ((tuplen = getlen(state, true)) != 0)
+				if ((tuplen = LENTUP(state, true)) != 0)
 				{
 					tup = READTUP(state, tuplen);
 					return tup;
@@ -1043,6 +1121,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				}
 			}
 
+			Assert(!state->datumTypeByVal);
 			/*
 			 * Backward.
 			 *
@@ -1060,7 +1139,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 				Assert(!state->truncated);
 				return NULL;
 			}
-			tuplen = getlen(state, false);
+			tuplen = LENTUP(state, false);
 
 			if (readptr->eof_reached)
 			{
@@ -1091,7 +1170,7 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					Assert(!state->truncated);
 					return NULL;
 				}
-				tuplen = getlen(state, false);
+				tuplen = LENTUP(state, false);
 			}
 
 			/*
@@ -1153,6 +1232,41 @@ tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 	}
 }
 
+bool
+tuplestore_getdatum(Tuplestorestate *state, bool forward,
+					bool *should_free, Datum *result)
+{
+	Datum datum;
+	*should_free = false;
+
+	datum = (Datum) tuplestore_gettuple(state, forward, should_free);
+
+	/* For by-value datum we may receive zero as valid value. */
+	if (state->datumTypeByVal)
+	{
+		/* Return false only on EOF */
+		if (state->readptrs[state->activeptr].eof_reached)
+		{
+			*result = PointerGetDatum(NULL);
+			return false;
+		}
+
+		*result = datum;
+		return true;
+	}
+
+	if (datum)
+	{
+		*result = datum;
+		return true;
+	}
+	else
+	{
+		*result = PointerGetDatum(NULL);
+		return false;
+	}
+}
+
 /*
  * tuplestore_gettupleslot_force - exported function to fetch a tuple
  *
@@ -1205,10 +1319,20 @@ tuplestore_advance(Tuplestorestate *state, bool forward)
 			pfree(tuple);
 		return true;
 	}
-	else
+
+	/*
+	 * A NULL return normally means end-of-data, but for by-value datum
+	 * stores a valid zero-valued datum (e.g., false, 0) is indistinguishable
+	 * from NULL via pointer check.  Use eof_reached to distinguish.
+	 */
+	if (state->datumTypeByVal)
 	{
-		return false;
+		TSReadPointer *readptr = &state->readptrs[state->activeptr];
+
+		return !readptr->eof_reached;
 	}
+
+	return false;
 }
 
 /*
@@ -1271,7 +1395,13 @@ tuplestore_skiptuples(Tuplestorestate *state, int64 ntuples, bool forward)
 				tuple = tuplestore_gettuple(state, forward, &should_free);
 
 				if (tuple == NULL)
-					return false;
+				{
+					/* See tuplestore_advance for why pointer check is insufficient */
+					if (!state->datumTypeByVal ||
+						state->readptrs[state->activeptr].eof_reached)
+						return false;
+					continue;
+				}
 				if (should_free)
 					pfree(tuple);
 				CHECK_FOR_INTERRUPTS();
@@ -1505,8 +1635,11 @@ tuplestore_trim(Tuplestorestate *state)
 	/* Release no-longer-needed tuples */
 	for (i = state->memtupdeleted; i < nremove; i++)
 	{
-		FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
-		pfree(state->memtuples[i]);
+		if (!state->datumTypeByVal)
+		{
+			FREEMEM(state, GetMemoryChunkSpace(state->memtuples[i]));
+			pfree(state->memtuples[i]);
+		}
 		state->memtuples[i] = NULL;
 		/* As in dumptuples(), increment memtupdeleted synchronously */
 		state->memtupdeleted++;
@@ -1603,25 +1736,6 @@ tuplestore_in_memory(Tuplestorestate *state)
 	return (state->status == TSS_INMEM);
 }
 
-
-/*
- * Tape interface routines
- */
-
-static unsigned int
-getlen(Tuplestorestate *state, bool eofOK)
-{
-	unsigned int len;
-	size_t		nbytes;
-
-	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
-	if (nbytes == 0)
-		return 0;
-	else
-		return len;
-}
-
-
 /*
  * Routines specialized for HeapTuple case
  *
@@ -1632,6 +1746,19 @@ getlen(Tuplestorestate *state, bool eofOK)
  * to write that separately.
  */
 
+static unsigned int
+lentup_heap(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	else
+		return len;
+}
+
 static void *
 copytup_heap(Tuplestorestate *state, void *tup)
 {
@@ -1678,3 +1805,127 @@ readtup_heap(Tuplestorestate *state, unsigned int len)
 		BufFileReadExact(state->myfile, &tuplen, sizeof(tuplen));
 	return tuple;
 }
+
+/*
+ * Routines specialized for Datum case.
+ *
+ * Handles both fixed and variable-length Datums efficiently:
+ * - Fixed-length and Variable-length includes length prefix (and suffix if backward scan)
+ * - By-value types handled inline without extra copying, storing single extra byte
+ *   XXX: consider refactoring to avoid it, currently need it for correct rewind logic
+ */
+
+static unsigned int
+lentup_datum(Tuplestorestate *state, bool eofOK)
+{
+	unsigned int len;
+	size_t		nbytes;
+
+	Assert(state->datumType != InvalidOid);
+
+	if (state->datumTypeByVal)
+	{
+		uint8	junk;
+		nbytes = BufFileReadMaybeEOF(state->myfile, &junk, sizeof(uint8), eofOK);
+		if (nbytes == 0)
+			return 0;
+		Assert(junk == (uint8) state->datumTypeLen);
+		return state->datumTypeLen;
+	}
+
+	nbytes = BufFileReadMaybeEOF(state->myfile, &len, sizeof(len), eofOK);
+	if (nbytes == 0)
+		return 0;
+	return len;
+}
+
+static void *
+copytup_datum(Tuplestorestate *state, void *datum)
+{
+	Datum d;
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+		return DatumGetPointer(PointerGetDatum(datum));
+
+	if (datum == NULL)
+		return NULL;
+
+	d = datumCopy(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+	USEMEM(state, GetMemoryChunkSpace(DatumGetPointer(d)));
+	return DatumGetPointer(d);
+}
+
+static void
+writetup_datum(Tuplestorestate *state, void *datum)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		uint8 junk = state->datumTypeLen; /* overflow is ok */
+		Datum v;
+		Assert(state->datumTypeLen > 0);
+
+		/* just marker byte used to track the end of data for rewind logic */
+		BufFileWrite(state->myfile, &junk, sizeof(junk));
+		store_att_byval(&v, PointerGetDatum(datum), state->datumTypeLen);
+		BufFileWrite(state->myfile, &v, state->datumTypeLen);
+		Assert(!state->backward);
+	}
+	else
+	{
+		unsigned int size;
+		unsigned int tuplen;
+
+		if (state->datumTypeLen < 0)
+			size = datumGetSize(PointerGetDatum(datum), state->datumTypeByVal, state->datumTypeLen);
+		else
+			size = state->datumTypeLen;
+
+		/*
+		 * Include sizeof(unsigned int) in the stored length, matching the
+		 * convention used by writetup_heap.  The backward-scan seek
+		 * arithmetic in tuplestore_gettuple assumes this.
+		 */
+		tuplen = size + sizeof(unsigned int);
+		BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		BufFileWrite(state->myfile, datum, size);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileWrite(state->myfile, &tuplen, sizeof(tuplen));
+
+		FREEMEM(state, GetMemoryChunkSpace(datum));
+		pfree(datum);
+	}
+}
+
+static void *
+readtup_datum(Tuplestorestate *state, unsigned int len)
+{
+	Assert(state->datumType != InvalidOid);
+	if (state->datumTypeByVal)
+	{
+		Datum datum = 0;
+
+		Assert(state->datumTypeLen > 0);
+		Assert(len == state->datumTypeLen);
+		BufFileReadExact(state->myfile, &datum, state->datumTypeLen);
+
+		Assert(!state->backward);
+		return DatumGetPointer(fetch_att(&datum, true, state->datumTypeLen));
+	}
+	else
+	{
+		unsigned int datalen = len - sizeof(unsigned int);
+		void *data = palloc(datalen);
+
+		BufFileReadExact(state->myfile, data, datalen);
+
+		/* need trailing length word? */
+		if (state->backward)
+			BufFileReadExact(state->myfile, &len, sizeof(len));
+
+		return data;
+	}
+}
diff --git a/src/include/utils/tuplestore.h b/src/include/utils/tuplestore.h
index f638b96e156..e16d9a3d352 100644
--- a/src/include/utils/tuplestore.h
+++ b/src/include/utils/tuplestore.h
@@ -1,17 +1,18 @@
 /*-------------------------------------------------------------------------
  *
  * tuplestore.h
- *	  Generalized routines for temporary tuple storage.
+ *	  Generalized routines for temporary storage of tuples and Datums.
  *
- * This module handles temporary storage of tuples for purposes such
- * as Materialize nodes, hashjoin batch files, etc.  It is essentially
- * a dumbed-down version of tuplesort.c; it does no sorting of tuples
- * but can only store and regurgitate a sequence of tuples.  However,
- * because no sort is required, it is allowed to start reading the sequence
- * before it has all been written.  This is particularly useful for cursors,
- * because it allows random access within the already-scanned portion of
- * a query without having to process the underlying scan to completion.
- * Also, it is possible to support multiple independent read pointers.
+ * This module handles temporary storage of either tuples or single
+ * Datum values for purposes such as Materialize nodes, hashjoin batch
+ * files, etc. It is essentially a dumbed-down version of tuplesort.c;
+ * it does no sorting of tuples but can only store and regurgitate a sequence
+ * of tuples.  However, because no sort is required, it is allowed to start
+ * reading the sequence before it has all been written.
+ *
+ * This is particularly useful for cursors, because it allows random access
+ * within the already-scanned portion of a query without having to process
+ * the underlying scan to completion.
  *
  * A temporary file is used to handle the data if it exceeds the
  * space limit specified by the caller.
@@ -39,14 +40,13 @@
  */
 typedef struct Tuplestorestate Tuplestorestate;
 
-/*
- * Currently we only need to store MinimalTuples, but it would be easy
- * to support the same behavior for IndexTuples and/or bare Datums.
- */
-
 extern Tuplestorestate *tuplestore_begin_heap(bool randomAccess,
 											  bool interXact,
 											  int maxKBytes);
+extern Tuplestorestate *tuplestore_begin_datum(Oid datumType,
+											   bool randomAccess,
+											   bool interXact,
+											   int maxKBytes);
 
 extern void tuplestore_set_eflags(Tuplestorestate *state, int eflags);
 
@@ -55,6 +55,7 @@ extern void tuplestore_puttupleslot(Tuplestorestate *state,
 extern void tuplestore_puttuple(Tuplestorestate *state, HeapTuple tuple);
 extern void tuplestore_putvalues(Tuplestorestate *state, TupleDesc tdesc,
 								 const Datum *values, const bool *isnull);
+extern void tuplestore_putdatum(Tuplestorestate *state, Datum datum);
 
 extern int	tuplestore_alloc_read_pointer(Tuplestorestate *state, int eflags);
 
@@ -72,6 +73,8 @@ extern bool tuplestore_in_memory(Tuplestorestate *state);
 
 extern bool tuplestore_gettupleslot(Tuplestorestate *state, bool forward,
 									bool copy, TupleTableSlot *slot);
+extern bool tuplestore_getdatum(Tuplestorestate *state, bool forward,
+								bool *should_free, Datum *result);
 
 extern bool tuplestore_gettupleslot_force(Tuplestorestate *state, bool forward,
 										  bool copy, TupleTableSlot *slot);
-- 
2.43.0



  [application/octet-stream] v35-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch (36.6K, 5-v35-0002-Add-STIR-access-method-and-flags-related-to-auxi.patch)
  download | inline diff:
From e752c9ff26e718e209dfb928bfd379c5302a1f77 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sun, 11 Jan 2026 19:27:52 +0300
Subject: [PATCH v35 2/7] Add STIR access method and flags related to auxiliary
 indexes

This patch provides infrastructure for following enhancements to concurrent index builds by:
- ii_Auxiliary in IndexInfo: indicates that an index is an auxiliary index used during concurrent index build
- validate_index in IndexVacuumInfo: set if index_bulk_delete called during the validation phase of concurrent index build
- STIR (Short-Term Index Replacement) access method is introduced, intended solely for short-lived, auxiliary usage

STIR functions are designed as an ephemeral helper during concurrent index builds, temporarily storing TIDs without providing the full features of a typical access method. As such, it raises warnings or errors when accessed outside its specialized usage path.

Planned to be used in following commits.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   1 +
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 567 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/catalog/toasting.c           |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 110 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   7 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 24 files changed, 765 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 6a7f8cb4a7c..5b5984e3aa2 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -285,6 +285,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index e88d72ea039..ebbcfa90715 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -19,6 +19,7 @@ SUBDIRS	    = \
 	nbtree \
 	rmgrdesc \
 	spgist \
+	stir \
 	sequence \
 	table \
 	tablesample \
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 39395aed0d5..a6ac89360fc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3024,6 +3024,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -3075,6 +3076,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 5fd18de74f9..7219c65f365 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..8785dab37bd
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..4b7ad15346c
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..932590d9ccb
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,567 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurrent index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 4. gets dropped
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/amvalidate.h"
+#include "access/htup_details.h"
+#include "access/stir.h"
+#include "access/tableam.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "commands/vacuum.h"
+#include "miscadmin.h"
+#include "storage/bufmgr.h"
+#include "utils/catcache.h"
+#include "utils/fmgrprotos.h"
+#include "utils/index_selfuncs.h"
+#include "utils/memutils.h"
+#include "utils/regproc.h"
+#include "utils/syscache.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions = VACUUM_OPTION_NO_PARALLEL;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not a real index, so validate may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+					        opfamilyname,
+					        format_operator(oprform->amopopr),
+					        oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+		                          oprform->amoplefttype,
+		                          oprform->amoprighttype))
+		{
+			ereport(INFO,
+			        (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+				        errmsg("stir opfamily %s contains operator %s with wrong signature",
+					        opfamilyname,
+					        format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+/*
+ * Initialize meta-page of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magicNumber = STIR_MAGIC_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower = ((char *) metadata + sizeof(StirMetaPageData)) - (char *) metaPage;
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	/*
+	 * Make a new page; since it is the first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	metaPage = BufferGetPage(metaBuffer);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+
+	MarkBufferDirty(metaBuffer);
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if the tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	char *ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does the new tuple fit on the page? */
+	if (StirPageGetFreeSpace(page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy a new tuple to the end of the page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy(itup, tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (char *) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple itup;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	BlockNumber blkNo;
+
+	itup.heapPtr = *ht_ctid;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			page = BufferGetPage(buffer);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to the existing page */
+			if (StirPageAddItem(page, &itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				MarkBufferDirty(buffer);
+
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				return false;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add a new page - get exclusive lock on meta-page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+
+		/* Re-check after acquiring exclusive lock */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+
+		/* Check if another backend already extended the index */
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, let's try again */
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+			page = BufferGetPage(buffer);
+
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, &itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta-page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+
+			MarkBufferDirty(metaBuffer);
+			MarkBufferDirty(buffer);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc
+stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta-page without any heap scans.
+ */
+IndexBuildResult *
+stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("Building STIR indexes is not supported")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *
+stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/*
+	 * For normal VACUUM, mark to skip inserts and warn about an index drop
+	 * needed.  In practice this path is not reachable during CREATE INDEX
+	 * CONCURRENTLY because the table-level locks held by CIC prevent concurrent
+	 * VACUUM from opening the auxiliary index.  It can only be reached if a
+	 * leftover STIR index somehow survives after a failed CIC and a later
+	 * VACUUM encounters it.
+	 */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because the index is marked as not-ready for that moment and the index is not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point(false);
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void
+StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+
+	Assert(!RelationNeedsWAL(index));
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	metaPage = BufferGetPage(metaBuffer);
+	metaData = StirPageGetMeta(metaPage);
+
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		MarkBufferDirty(metaBuffer);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * As with stirbulkdelete, this is not reachable during a normal CIC due to
+ * table-level locking.  It serves as a safety net for leftover STIR indexes
+ * from failed concurrent index builds.
+ */
+IndexBulkDeleteResult *
+stirvacuumcleanup(IndexVacuumInfo *info,
+				  IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not implemented, seems like this index needs to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *
+stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void
+stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not implemented", __func__)));
+}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9407c357f27..cc067e58d36 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3432,6 +3432,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index 4aa52a4bd25..d7ea86b2805 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -314,6 +314,7 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 	indexInfo->ii_ParallelWorkers = 0;
 	indexInfo->ii_Am = BTREE_AM_OID;
 	indexInfo->ii_AmCache = NULL;
+	indexInfo->ii_Auxiliary = false;
 	indexInfo->ii_Context = CurrentMemoryContext;
 
 	collationIds[0] = InvalidOid;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 020a5919b84..e82a6926e8c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -731,6 +731,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 979c2be4abd..9db6b17abdc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -1092,6 +1092,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 3cd35c5c457..5359dab1176 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -875,6 +875,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 68bfe405db3..e4a666b2f72 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -58,6 +58,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index e8cb7f7a627..7f3f08a70ac 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..b08cf4d4ef0
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef STIR_H
+#define STIR_H
+
+#include "access/amapi.h"
+#include "nodes/pathnodes.h"
+#include "storage/bufpage.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((char *)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on the page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magicNumber;
+	BlockNumber	lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts? */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGIC_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif			/* STIR_H */
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index 46d361047fe..8bd2c2b46ba 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index df170b80840..a3457e749db 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -492,4 +492,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index 7a027c4810e..6ffc20a061c 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -308,5 +308,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fa9ae79082b..8f701faf6dd 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 13359180d25..3eaeed3c141 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -169,8 +169,8 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -230,7 +230,8 @@ typedef struct IndexInfo
 	bool		ii_WithoutOverlaps;
 	/* # of workers requested (excludes leader) */
 	int			ii_ParallelWorkers;
-
+	/* is auxiliary for concurrent index build? */
+	bool		ii_Auxiliary;
 	/* Oid of index AM */
 	Oid			ii_Am;
 	/* private cache area for index AM */
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index 74793a1a19d..bf0e30dabe9 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index cfdc6b1a17a..cc947194aa7 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2131,9 +2131,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index c8f3932edf0..ecc2c2a6049 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5171,7 +5171,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5185,7 +5186,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5210,9 +5212,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5221,12 +5223,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5235,7 +5238,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v35-0001-Add-stress-tests-for-concurrent-index-builds.patch (12.6K, 6-v35-0001-Add-stress-tests-for-concurrent-index-builds.patch)
  download | inline diff:
From 3b08777401347224a791b05b4c9c566fe98757f5 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v35 1/7] Add stress tests for concurrent index builds

Introduce stress tests for concurrent index operations:
- test concurrent inserts/updates during CREATE/REINDEX INDEX CONCURRENTLY
- cover various index types (btree, gin, gist, brin, hash, spgist)
- test unique and non-unique indexes
- test with expressions and predicates
- test both parallel and non-parallel operations
- test both read-committed and repeatable-read isolation levels

These tests verify the behavior of the following commits.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 293 ++++++++++++++++++++++++++++++++
 2 files changed, 294 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 592cef74ecb..51a62dccb7b 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..dd7a1eff0ef
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,293 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use constant STRESS_PGBENCH_CLIENTS => 30;
+use constant STRESS_PGBENCH_JOBS => 8;
+use constant STRESS_PGBENCH_TRANSACTIONS => 10000;
+use constant STRESS_MAX_SLEEP_MS => 10;
+
+use constant DEFAULT_PGBENCH_CLIENTS => 15;
+use constant DEFAULT_PGBENCH_JOBS => 4;
+use constant DEFAULT_PGBENCH_TRANSACTIONS => 500;
+use constant DEFAULT_MAX_SLEEP_MS => 1;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my $node;
+my $pg_test_extra = $ENV{PG_TEST_EXTRA} // '';
+my $is_stress = $pg_test_extra =~ /\bstress\b/ ? 1 : 0;
+my $pgbench_clients =
+  $is_stress ? STRESS_PGBENCH_CLIENTS : DEFAULT_PGBENCH_CLIENTS;
+my $pgbench_jobs = $is_stress ? STRESS_PGBENCH_JOBS : DEFAULT_PGBENCH_JOBS;
+my $pgbench_transactions =
+  $is_stress ? STRESS_PGBENCH_TRANSACTIONS : DEFAULT_PGBENCH_TRANSACTIONS;
+my $max_sleep_ms = $is_stress ? STRESS_MAX_SLEEP_MS : DEFAULT_MAX_SLEEP_MS;
+my $pgbench_options = sprintf(
+	'--no-vacuum --client=%d --jobs=%d --exit-on-abort --transactions=%d',
+	$pgbench_clients,
+	$pgbench_jobs,
+	$pgbench_transactions);
+my $no_hot = $is_stress ? int(rand(2)) : 0;
+
+print(
+		sprintf(
+		'settings: PG_TEST_EXTRA=%s stress=%d clients=%d jobs=%d transactions=%d max_sleep_ms=%d no_hot=%d',
+		defined($ENV{PG_TEST_EXTRA})
+		? ($pg_test_extra eq '' ? '(empty)' : $pg_test_extra)
+		: '(undef)',
+		$is_stress,
+		$pgbench_clients,
+		$pgbench_jobs,
+		$pgbench_transactions,
+		$max_sleep_ms,
+		$no_hot));
+print "\n";
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->append_conf('postgresql.conf', 'maintenance_work_mem = 32MB'); # to avoid OOM
+$node->append_conf('postgresql.conf', 'shared_buffers = 32MB'); # to avoid OOM
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE UNLOGGED TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp,
+								ia int4[], p point)));
+
+if ($no_hot) { $node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);)); }
+
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => sprintf(q(
+			SET debug_parallel_query = off; -- this is because predicate_stable implementation
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY new_idx ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for unique BTREE',
+	{
+		'concurrent_ops_unique_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					CREATE UNIQUE INDEX CONCURRENTLY new_idx ON tbl(i);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT bt_index_check('new_idx', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIN with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIN',
+	{
+		'concurrent_ops_gin_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIN (ia);
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					SELECT gin_index_check('new_idx');
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for GIST/BRIN/HASH/SPGIST index concurrently with upserts
+$node->pgbench(
+	$pgbench_options,
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY for GIST/BRIN/HASH/SPGIST',
+	{
+		'concurrent_ops_other_idx' => sprintf(q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\set use_rr random(0, 9)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :use_rr = 0
+						SET default_transaction_isolation = 'repeatable read';
+					\endif
+					\set variant random(0, 3)
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING GIST (p);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING BRIN (updated_at);
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING HASH (updated_at);
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY new_idx ON tbl USING SPGIST (p);
+					\endif
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					REINDEX INDEX CONCURRENTLY new_idx;
+					\set sleep_ms random(0, %d)
+					\sleep :sleep_ms ms
+					DROP INDEX CONCURRENTLY new_idx;
+					RESET default_transaction_isolation;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now(),ARRAY[floor(random()*100)::int],point(random(),random()))
+					ON CONFLICT(i) DO UPDATE SET updated_at = now(), ia = ARRAY[floor(random()*100)::int], p = point(random(),random());
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+			), $max_sleep_ms, $max_sleep_ms)
+		});
+
+$node->stop;
+done_testing();
-- 
2.43.0



  [application/octet-stream] v35-0006-Optimize-auxiliary-index-handling.patch (3.0K, 7-v35-0006-Optimize-auxiliary-index-handling.patch)
  download | inline diff:
From fe5dc728bf0b00388d244a8d6141b57299d9b327 Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 30 Dec 2024 16:37:12 +0100
Subject: [PATCH v35 6/7] Optimize auxiliary index handling

Skip unnecessary computations for auxiliary indices by:
- in the index-insert path, detect auxiliary indexes and bypass Datum value computation
- set indexUnchanged=false for auxiliary indices to avoid redundant checks

These optimizations reduce overhead during concurrent index operations.
---
 src/backend/catalog/index.c         | 9 +++++++++
 src/backend/executor/execIndexing.c | 5 ++++-
 src/include/nodes/execnodes.h       | 6 ++++--
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 9136dfc7c73..4edf68aced2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2940,6 +2940,15 @@ FormIndexDatum(IndexInfo *indexInfo,
 	ListCell   *indexpr_item;
 	int			i;
 
+	/* Auxiliary index does not need any values to be computed */
+	if (unlikely(indexInfo->ii_Auxiliary))
+	{
+		Assert(indexInfo->ii_Am == STIR_AM_OID);
+		memset(values, 0, sizeof(Datum) * indexInfo->ii_NumIndexAttrs);
+		memset(isnull, true, sizeof(bool) * indexInfo->ii_NumIndexAttrs);
+		return;
+	}
+
 	if (indexInfo->ii_Expressions != NIL &&
 		indexInfo->ii_ExpressionsState == NIL)
 	{
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index eb383812901..2df77a0606d 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -439,8 +439,11 @@ ExecInsertIndexTuples(ResultRelInfo *resultRelInfo,
 		 * There's definitely going to be an index_insert() call for this
 		 * index.  If we're being called as part of an UPDATE statement,
 		 * consider if the 'indexUnchanged' = true hint should be passed.
+		 *
+		 * For auxiliary indexes, always pass false to skip value comparison checks,
+		 * since auxiliary indexes only store TIDs and don't track value changes.
 		 */
-		indexUnchanged = ((flags & EIIT_IS_UPDATE) &&
+		indexUnchanged = ((flags & EIIT_IS_UPDATE) && !indexInfo->ii_Auxiliary &&
 						  index_unchanged_by_update(resultRelInfo,
 													estate,
 													indexInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index af58d4cf4b5..99916e150fd 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -169,8 +169,10 @@ typedef struct ExprState
  *		entries for a particular index.  Used for both index_build and
  *		retail creation of index entries.
  *
- * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
- * are used only during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
+ * during index build; they're conventionally zeroed otherwise.  ii_Auxiliary
+ * is also used during retail inserts to skip datum formation for auxiliary
+ * indexes.
  * ----------------
  */
 typedef struct IndexInfo
-- 
2.43.0



  [application/octet-stream] v35-0007-Refresh-snapshot-periodically-during-index-valid.patch (27.1K, 8-v35-0007-Refresh-snapshot-periodically-during-index-valid.patch)
  download | inline diff:
From 5a7eea41aa335b3592f83b6f8fb701e357c47e6b Mon Sep 17 00:00:00 2001
From: Mikhail Nikalayeu <[email protected]>
Date: Mon, 21 Apr 2025 14:11:53 +0200
Subject: [PATCH v35 7/7] Refresh snapshot periodically during index validation

Enhances validation phase of concurrently built indexes by periodically refreshing snapshots rather than using a single reference snapshot. This addresses issues with xmin propagation during long-running validations.

The validation now takes a fresh snapshot every few pages, allowing the xmin horizon to advance. This restores feature of commit d9d076222f5b, which was reverted in commit e28bb8851969. New STIR-based approach does not depend on single reference snapshot anymore.
---
 src/backend/access/heap/README.HOT         |  4 +-
 src/backend/access/heap/heapam_handler.c   | 77 +++++++++++++++++++++-
 src/backend/access/spgist/spgvacuum.c      | 12 +++-
 src/backend/catalog/index.c                | 73 +++++++++++++++-----
 src/backend/commands/indexcmds.c           | 52 +++------------
 src/backend/utils/misc/guc_parameters.dat  |  9 +++
 src/include/access/tableam.h               | 25 ++++---
 src/include/access/transam.h               | 15 +++++
 src/include/catalog/index.h                |  2 +-
 src/include/miscadmin.h                    |  1 +
 src/test/regress/expected/create_index.out |  3 +
 src/test/regress/sql/create_index.sql      |  4 ++
 12 files changed, 194 insertions(+), 83 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index b1c797517ee..382fe1723a5 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -401,12 +401,12 @@ live tuple.
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then validate
 the index.  This searches for tuples missing from the index in auxiliary
-index, and inserts any missing ones if they are visible to reference snapshot.
+index, and inserts any missing ones if they are visible to a fresh snapshot.
 Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8cbc4855078..65731224111 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -53,6 +53,9 @@
 /* GUC: percentage of maintenance_work_mem for CIC validation tuplestore */
 int			debug_cic_validate_store_mem_pct = 10;
 
+/* GUC: refresh snapshot every N pages during CIC validation (0 = disable) */
+int			debug_cic_validate_snapshot_pages = 4096;
+
 static void reform_and_rewrite_tuple(HeapTuple tuple,
 									 Relation OldHeap, Relation NewHeap,
 									 Datum *values, bool *isnull, RewriteState rwstate);
@@ -1971,24 +1974,35 @@ heapam_index_validate_scan_read_stream_next(
 	return result;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
 						   ValidateIndexState *state,
 						   ValidateIndexState *auxState)
 {
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
 
+	Snapshot		snapshot;
 	TupleTableSlot  *slot;
 	EState			*estate;
 	ExprContext		*econtext;
 	BufferAccessStrategy bstrategy = GetAccessStrategy(BAS_BULKREAD);
 
 	int64			num_to_check;
+	int64			page_read_counter = 1; /* set to 1 to skip snapshot reset at start */
 	Tuplestorestate *tuples_for_check;
+
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
 	ValidateIndexScanState callback_private_data;
 
 	Buffer buf;
@@ -1998,6 +2012,8 @@ heapam_index_validate_scan(Relation heapRelation,
 	/* Use a percentage of maintenance_work_mem for tuple store. */
 	int		store_work_mem_part = maintenance_work_mem * debug_cic_validate_store_mem_pct / 100;
 
+	PushActiveSnapshot(GetTransactionSnapshot());
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
@@ -2006,6 +2022,12 @@ heapam_index_validate_scan(Relation heapRelation,
 	 */
 	tuples_for_check = tuplestore_begin_datum(INT8OID, false, false, store_work_mem_part);
 
+	PopActiveSnapshot();
+	InvalidateCatalogSnapshot();
+
+	Assert(!reset_snapshot || !HaveRegisteredOrActiveSnapshot());
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+
 	/*
 	 * sanity checks
 	 */
@@ -2021,6 +2043,29 @@ heapam_index_validate_scan(Relation heapRelation,
 
 	state->tuplesort = auxState->tuplesort = NULL;
 
+	/*
+	 * Now take the first snapshot that will be used to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid, for that reason limitXmin is supported.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
+
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
@@ -2054,6 +2099,7 @@ heapam_index_validate_scan(Relation heapRelation,
 
 		LockBuffer(buf, BUFFER_LOCK_SHARE);
 		block_number = BufferGetBlockNumber(buf);
+		page_read_counter++;
 
 		i = 0;
 		while ((off = tuples[i]) != InvalidOffsetNumber)
@@ -2124,6 +2170,21 @@ heapam_index_validate_scan(Relation heapRelation,
 		}
 
 		ReleaseBuffer(buf);
+		if (reset_snapshot &&
+			debug_cic_validate_snapshot_pages > 0 &&
+			page_read_counter % debug_cic_validate_snapshot_pages == 0)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* Advance limitXmin so we wait for all snapshots seen so far */
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+		}
 	}
 
 	ExecDropSingleTupleTableSlot(slot);
@@ -2133,11 +2194,23 @@ heapam_index_validate_scan(Relation heapRelation,
 	read_stream_end(read_stream);
 	tuplestore_end(tuples_for_check);
 
+	/*
+	 * Drop the latest snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || MyProc->xmin == InvalidTransactionId);
 	FreeAccessStrategy(bstrategy);
 
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index c461f8dc02d..ef192fb99c2 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -191,14 +191,16 @@ vacuumLeafPage(spgBulkDeleteState *bds, Relation index, Buffer buffer,
 			 * Add target TID to pending list if the redirection could have
 			 * happened since VACUUM started.  (If xid is invalid, assume it
 			 * must have happened before VACUUM started, since REINDEX
-			 * CONCURRENTLY locks out VACUUM.)
+			 * CONCURRENTLY locks out VACUUM, if myXmin is invalid it is
+			 * validation scan.)
 			 *
 			 * Note: we could make a tighter test by seeing if the xid is
 			 * "running" according to the active snapshot; but snapmgr.c
 			 * doesn't currently export a suitable API, and it's not entirely
 			 * clear that a tighter test is worth the cycles anyway.
 			 */
-			if (TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
+			if (!TransactionIdIsValid(bds->myXmin) ||
+					TransactionIdFollowsOrEquals(dt->xid, bds->myXmin))
 				spgAddPendingTID(bds, &dt->pointer);
 		}
 		else
@@ -808,7 +810,6 @@ spgvacuumscan(spgBulkDeleteState *bds)
 	/* Finish setting up spgBulkDeleteState */
 	initSpGistState(&bds->spgstate, index);
 	bds->pendingList = NULL;
-	bds->myXmin = GetActiveSnapshot()->xmin;
 	bds->lastFilledBlock = SPGIST_LAST_FIXED_BLKNO;
 
 	/*
@@ -959,6 +960,10 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bds.stats = stats;
 	bds.callback = callback;
 	bds.callback_state = callback_state;
+	if (info->validate_index)
+		bds.myXmin = InvalidTransactionId;
+	else
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 	spgvacuumscan(&bds);
 
@@ -999,6 +1004,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.stats = stats;
 		bds.callback = dummy_callback;
 		bds.callback_state = NULL;
+		bds.myXmin = GetActiveSnapshot()->xmin;
 
 		spgvacuumscan(&bds);
 	}
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 4edf68aced2..49adcb152cf 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -69,6 +69,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -3538,8 +3539,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At the same moment we clear "indisready" for
  * auxiliary index, since it is no more required to be updated.
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3552,7 +3554,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
@@ -3565,21 +3567,24 @@ IndexCheckExclusion(Relation heapRelation,
  * before it declares a uniqueness error.
  *
  * After completing validate_index(), we wait until all transactions that
- * were alive at the time of the reference snapshot are gone; this is
- * necessary to be sure there are none left with a transaction snapshot
- * older than the reference (and hence possibly able to see tuples we did
- * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
+ * were alive at the time of the latest snapshot used during validation are
+ * gone; this is necessary to be sure there are none left with a transaction
+ * snapshot older than that (and hence possibly able to see tuples we did
+ * not index).  The snapshot is periodically refreshed during the heap scan
+ * to propagate the xmin horizon, so limitXmin tracks the most recent one.
+ * Then we mark the index "indisvalid" and commit.  Subsequent transactions
+ * will be able to use it for queries.
  *
  * Also, some actions to concurrent drop the auxiliary index are performed.
  */
-void
-validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
 				indexRelation,
 				auxIndexRelation;
 	IndexInfo  *indexInfo;
+	TransactionId limitXmin;
 	IndexVacuumInfo ivinfo, auxivinfo;
 	ValidateIndexState state, auxState;
 	Oid			save_userid;
@@ -3592,6 +3597,16 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	int			main_work_mem_part = (int)((int64) maintenance_work_mem * 8 / 10);
 	int			aux_work_mem_part = maintenance_work_mem / 10;
 
+	/*
+	 * Under REPEATABLE READ or SERIALIZABLE (possible via
+	 * default_transaction_isolation), GetLatestSnapshot() returns the
+	 * transaction-level snapshot and xmin stays pinned.  Periodic snapshot
+	 * refresh is pointless in that case, so skip it.
+	 */
+#ifdef USE_ASSERT_CHECKING
+	bool		reset_snapshot = XactIsoLevel <= XACT_READ_COMMITTED;
+#endif
+
 	{
 		const int	progress_index[] = {
 			PROGRESS_CREATEIDX_PHASE,
@@ -3629,8 +3644,12 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3666,6 +3685,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 										   NULL, TUPLESORT_NONE);
 	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	(void) index_bulk_delete(&auxivinfo, NULL,
 							 validate_index_callback, &auxState);
 	/* If aux index is empty, merge may be skipped */
@@ -3685,7 +3707,13 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		index_close(indexRelation, NoLock);
 		table_close(heapRelation, NoLock);
 
-		return;
+		PushActiveSnapshot(GetTransactionSnapshot());
+		limitXmin = GetActiveSnapshot()->xmin;
+		PopActiveSnapshot();
+		InvalidateCatalogSnapshot();
+
+		Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+		return limitXmin;
 	}
 
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
@@ -3694,6 +3722,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
+	/* tuplesort_begin_datum may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	/* ambulkdelete updates progress metrics */
 	(void) index_bulk_delete(&ivinfo, NULL,
 							 validate_index_callback, &state);
@@ -3713,19 +3744,24 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+
 	tuplesort_performsort(auxState.tuplesort);
+	/* tuplesort_performsort may require catalog snapshot */
+	InvalidateCatalogSnapshot();
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
 
 	/*
 	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state,
-							  &auxState);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
 	/* Tuple sort closed by table_index_validate_scan */
 	Assert(state.tuplesort == NULL && auxState.tuplesort == NULL);
@@ -3748,6 +3784,9 @@ validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot)
 	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!reset_snapshot || !TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 46c4ccc6789..a700068f8a2 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -596,7 +596,6 @@ DefineIndex(ParseState *pstate,
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -1816,32 +1815,11 @@ DefineIndex(ParseState *pstate,
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
-
-	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
-	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
 	/*
 	 * Merge content of auxiliary and target indexes - insert any missing index entries.
 	 */
-	validate_index(tableId, indexRelationId, auxIndexRelationId, snapshot);
-	limitXmin = snapshot->xmin;
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
-	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
 	 * are holding back our process's advertised xmin; in particular, if
@@ -1863,8 +1841,8 @@ DefineIndex(ParseState *pstate,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define-index-before-set-valid", NULL);
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
@@ -4433,7 +4411,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4448,13 +4425,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4466,16 +4436,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[3] = newidx->amId;
 		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4485,10 +4446,13 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		CommitTransactionCommand();
 		StartTransactionCommand();
 
+		/* We should now definitely not be advertising any xmin. */
+		Assert(!TransactionIdIsValid(MyProc->xmin));
+
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3477866d729..6b5b130f2aa 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -640,6 +640,15 @@
   boot_val => 'DEFAULT_ASSERT_ENABLED',
 },
 
+{ name => 'debug_cic_validate_snapshot_pages', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Refresh snapshot every N pages during CIC validation (0 to disable).',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'debug_cic_validate_snapshot_pages',
+  boot_val => '4096',
+  min => '0',
+  max => '1000000',
+},
+
 { name => 'debug_cic_validate_store_mem_pct', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
   short_desc => 'Percentage of maintenance_work_mem used for CIC validation tuplestore.',
   flags => 'GUC_NOT_IN_SAMPLE',
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index da3598663bc..9298f68d18a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -739,12 +739,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										IndexInfo *index_info,
-										Snapshot snapshot,
-										ValidateIndexState *state,
-										ValidateIndexState *aux_state);
+	TransactionId		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												IndexInfo *index_info,
+												ValidateIndexState *state,
+												ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1911,20 +1910,18 @@ table_index_build_range_scan(Relation table_rel,
  * Note: it is responsibility of that function to close sortstates in
  * both `state` and `auxstate`.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  IndexInfo *index_info,
-						  Snapshot snapshot,
 						  ValidateIndexState *state,
 						  ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state,
-											   auxstate);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 55a4ab26b34..923aadbab43 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -415,6 +415,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 3239e5c716f..def7352a859 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -159,7 +159,7 @@ extern void index_build(Relation heapRelation,
 						bool parallel,
 						bool progress);
 
-extern void validate_index(Oid heapId, Oid indexId, Oid auxIndexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 8c2b3a9c5e7..2ad3deff9cd 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -272,6 +272,7 @@ extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT double hash_mem_multiplier;
 extern PGDLLIMPORT int maintenance_work_mem;
 extern PGDLLIMPORT int debug_cic_validate_store_mem_pct;
+extern PGDLLIMPORT int debug_cic_validate_snapshot_pages;
 extern PGDLLIMPORT int max_parallel_maintenance_workers;
 
 /*
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 2d6abb15a89..758c5884ff5 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -3382,6 +3382,9 @@ DROP INDEX aux_index_ind6;
 --------+---------+-----------+----------+---------
  c1     | integer |           |          | 
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
 DROP TABLE aux_index_tab5;
 -- Check handling of indexes with expressions and predicates.  The
 -- definitions of the rebuilt indexes should match the original
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index fd96d80abbc..65dd58b947d 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -1400,6 +1400,10 @@ DROP INDEX aux_index_ind6;
 -- Make sure auxiliary index dropped too
 \d aux_index_tab5
 
+SET default_transaction_isolation = 'repeatable read';
+CREATE INDEX CONCURRENTLY aux_index_ind6 ON aux_index_tab5 (c1);
+SET default_transaction_isolation = 'read committed';
+
 DROP TABLE aux_index_tab5;
 
 -- Check handling of indexes with expressions and predicates.  The
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:05                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-01 10:29                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-01 10:49                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 07:28                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:27                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 11:12                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2026-03-09 00:09                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-23 22:08                                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-28 19:17                                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-03-31 22:11                                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-06 18:21                                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2026-04-07 01:42                                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Josh Kupershmidt <[email protected]>
  2026-04-07 23:19                                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2026-04-13 01:05                                                                               ` Josh Kupershmidt <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Josh Kupershmidt @ 2026-04-13 01:05 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Antonin Houska <[email protected]>; Hannu Krosing <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Tue, Apr 7, 2026 at 7:20 PM Mihail Nikalayeu <[email protected]>
wrote:

> Hello, Josh!
>
> Your review looks a bit LLM-generated, but anyway - thanks for review! :)
> Especially because at least one point seems to be valid.
>
> > We're leaving behind two invalid indexes now that the user has to figure
> > out how to drop in case of an error - that seems like it could be
> confusing for the user.
> > Could we have some better way (error handler, background worker) try to
> perform this cleanup automatically?
> > If not, we should at least tell the user clearly in the error message
> that both
> > invalid indexes are left behind (i.e. "idx" and "idx_ccaux" in the
> example)
>
> Commit 0005 adds automatic dropping of auxiliary indexes when the
> original index is reindexed or dropped. Also, documentation reflects
> the ccaux index (similar to ccnew).
>

Well, we auto-drop the aux index if the user figures out that they should
drop the main index first.


> > Docs are inconsistent or confusing about whether there's one or two
> indexes left behind in case of error
> > - e.g. "command will fail but leave behind *an* invalid index and its
> associated auxiliary index"
> > somewhat implying there is only one invalid index, and somehow the
> auxiliary index is valid?
>
> Auxiliary index is never marked as valid; I'm not sure we need to
> highlight it here. Or do you have an idea how to rephrase?
>

For example in this doc change hunk:

@@ -664,12 +665,19 @@ postgres=# \d tab
  col    | integer |           |          |
 Indexes:
     "idx" btree (col) INVALID
+    "idx_ccaux" stir (col) INVALID
 </programlisting>

-    The recommended recovery
-    method in such cases is to drop the index and try again to perform
-    <command>CREATE INDEX CONCURRENTLY</command>.  (Another possibility is
-    to rebuild the index with <command>REINDEX INDEX
CONCURRENTLY</command>).
+    The recommended recovery method in such cases is to drop the index with
+    <command>DROP INDEX</command>. The auxiliary index (suffixed with
+    <literal>_ccaux</literal>) will be automatically dropped when the main
+    index is dropped. After dropping the indexes, you can try again to
perform
+    <command>CREATE INDEX CONCURRENTLY</command>. (Another possibility is
to
+    rebuild the index with <command>REINDEX INDEX CONCURRENTLY</command>,
+    which will also handle cleanup of any invalid auxiliary indexes.)
+    If the only invalid index is one suffixed <literal>_ccaux</literal>,
+    the recommended recovery method is just <literal>DROP INDEX</literal>
+    for that index.
    </para>

The output we're showing the user from psql is two INVALID indexes, and
we're keeping the original doc suggestion on the first line that "The
recommended recovery method in such cases is to drop the index with DROP
INDEX". The next sentence clarifies a bit that there's an auxiliary index
that "will be automatically dropped". But now it's on the user to figure
out which index is which, and drop the right one.


> > Similarly, when the doc mentions e.g. "drop the index" - it's not
> necessarily clear which index
> > we're talking about when there are two invalid indexes left behind that
> the user will see with `\d`
>
> In one commit it says: "method in such cases is to drop these indexes
> and try again to perform".
> After 0005 "The auxiliary index (suffixed with
> <literal>_ccaux</literal>) will be automatically dropped when the main
> index is dropped".
> It seems clear to me, but feel free to provide your variant.
>
> >  * It would be nice to guard against users trying arbitrary CREATE INDEX
> ... USING stir(...) with a clear error
>
> It will fail with "Building STIR indexes is not supported".
>

Sorry, you are right, this is handled with a good error.


>
> > One of the testcases (line 2478 of patch 0004) does `DELETE FROM
> concur_reindex_tab4 WHERE c1 = 1;`
> > but the table `concur_reindex_tab4` looks like it has been dropped a few
> lines above that?
>
> Hm, yep, I'll fix it.
>
> > The StirPageGetFreeSpace macro from patch 0002 reads
> `StirPageGetMaxOffset(page)`
> > which seems like it could cause an unsafe read of opaque->maxoff if used
> on the metapage
>
> But it was never used for the metapage.
>

Yes, but I think it'd be better not to leave a possible foot-gun around for
the next developer. Even just adding an AssertMacro like:

#define StirPageGetMaxOffset(page) (AssertMacro(!StirPageIsMeta(page)),
StirPageGetOpaque(page)->maxoff)

There are some other asserts for some of the trickier bookkeeping that
happens in this patch that I think would help check the code, and make it
easier to understand as well. E.g. adding an assertion check at the end of
StirPageAddItem(), and inside stirbulkdelete() (I tried calling it around
L499 of stir.c, just before 'while (itup < itupEnd)').

/*
 * Validate that maxoff and pd_lower are consistent on a STIR data page.
 *
 * On a freshly initialized empty page, pd_lower is SizeOfPageHeaderData
 * (set by PageInit).  After the first insert, pd_lower is computed from
 * PageGetContents which uses MAXALIGN(SizeOfPageHeaderData).
 */
static inline void
StirPageValidate(Page page)
{
Assert(!StirPageIsMeta(page));
Assert(StirPageGetOpaque(page)->maxoff == 0
  ? ((PageHeader) page)->pd_lower == SizeOfPageHeaderData
  : ((PageHeader) page)->pd_lower ==
    MAXALIGN(SizeOfPageHeaderData) +
    StirPageGetOpaque(page)->maxoff * sizeof(StirTuple));
}


>
> > A comment explains "No predicate evaluation is needed here" , i.e. we
> are skipping predicate
> > evaluation in the validation scan step, assuming that the
> > auxiliary index contains only qualifying TIDs. Is this really
> bulletproof for e.g. partial indexes which may
> > no longer satisfy the predicate at the time of the validation scan due
> to conflicting HOT updates?
>
> Conflicting HOT updates are not possible because the catalog contains
> the new index definition from the start of the process.
> Or do you mean a different scenario?
>

Sorry for the false alarm, I believe you are right - I had to double
check RelationGetIndexAttrBitmap(), but I believe this is safe based on
the hotblockingattrs bitmap.

Overall, this is a nice improvement for CIC/RC that I think should help
particularly on large, busy systems.

Thanks,
Josh


^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-11-28 18:31                                                     ` Matthias van de Meent <[email protected]>
  2025-11-28 19:08                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-28 18:31 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, 28 Nov 2025 at 18:58, Hannu Krosing <[email protected]> wrote:
>
> On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> <[email protected]> wrote:
> >
> ...
> > I'm a bit worried, though, that LR may lose updates due to commit
> > order differences between WAL and PGPROC. I don't know how that's
> > handled in logical decoding, and can't find much literature about it
> > in the repo either.
>
> Now the reference to logical decoding made me think that maybe to real
> fix for CIC would be to leverage logical decoding for the 2nd pass of
> CIC and not wore about in-page visibilities at all.

-1: Requiring the logical decoding system just to reindex an index
without O(tablesize) lock time adds too much overhead, and removes
features we currently have (CIC on unlogged tables). wal_level=logical
*must not* be required for these tasks if we can at all avoid it.
I'm also not sure whether logical decoding gets access to the HOT
information of the updated tuples involved, and therefore whether the
index build can determine whether it must or can't insert the tuple.

I don't think logical decoding is sufficient, because we don't know
which tuples were already inserted into the index by their own
backends, so we don't know which tuples' index entries we must skip.


Kind regards,

Matthias van de Meent.

PS. I think the same should be true for REPACK CONCURRENTLY, but
that's a new command with yet-to-be-determined semantics, unlike CIC
which has been part of PG for 6 years.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:31                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-11-28 19:08                                                       ` Hannu Krosing <[email protected]>
  2025-11-28 20:41                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-02 12:02                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Hannu Krosing @ 2025-11-28 19:08 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, Nov 28, 2025 at 7:31 PM Matthias van de Meent
<[email protected]> wrote:
>
> On Fri, 28 Nov 2025 at 18:58, Hannu Krosing <[email protected]> wrote:
> >
> > On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> > <[email protected]> wrote:
> > >
> > ...
> > > I'm a bit worried, though, that LR may lose updates due to commit
> > > order differences between WAL and PGPROC. I don't know how that's
> > > handled in logical decoding, and can't find much literature about it
> > > in the repo either.
> >
> > Now the reference to logical decoding made me think that maybe to real
> > fix for CIC would be to leverage logical decoding for the 2nd pass of
> > CIC and not wore about in-page visibilities at all.
>
> -1: Requiring the logical decoding system just to reindex an index
> without O(tablesize) lock time adds too much overhead, and removes
> features we currently have (CIC on unlogged tables). wal_level=logical
> *must not* be required for these tasks if we can at all avoid it.
> I'm also not sure whether logical decoding gets access to the HOT
> information of the updated tuples involved, and therefore whether the
> index build can determine whether it must or can't insert the tuple.

There are more and more cases (not just CIC here) where using logical
decoding would be the most efficient solution, so why not instead
start improving it instead of complicating the system in various
places?

We could even start selectively logging UNLOGGED and TEMP tables when
we start CIC if CIC has enough upsides.

> I don't think logical decoding is sufficient, because we don't know
> which tuples were already inserted into the index by their own
> backends, so we don't know which tuples' index entries we must skip.

The premise of pass2 in CIC is that we collect all the rows that were
inserted after CIC started for which we are not 100% sure that they
are inserted in the index. We can only be sure they are inserted for
transactions started after pass1 completed and the index became
visible and available for inserts.

I am sure that it is possible to avoid inserting duplicate entry (same
value and tid) at insert time.

And we do not care about hot update chains dusing normal CREATE INDEX
or first pass of CIC - we just index what is visible NOW wit no regard
of weather the tuple is at the end of HOT update chain.

> Kind regards,
>
> Matthias van de Meent.
>
> PS. I think the same should be true for REPACK CONCURRENTLY, but
> that's a new command with yet-to-be-determined semantics, unlike CIC
> which has been part of PG for 6 years.

CIC has been around way longer, since 8.2 released in 2006, so more
like 20 years :)

---
Cheers
Hannu





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:31                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 19:08                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-11-28 20:41                                                         ` Mihail Nikalayeu <[email protected]>
  2025-11-28 21:01                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-28 20:41 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Antonin Houska <[email protected]>

Hi, Hannu!

I think you pressed "Reply" instead of "Reply All" - so, I put it to
the list (looks like nothing is secret here).
Mostly it is because of my opinion at the end of the mail which I want
to share with the list.

On Fri, Nov 28, 2025 at 8:33 PM Hannu Krosing <[email protected]> wrote:
> If it is an *index AM* then this may not solve HOT chains issue (see
> below), if we put it on top of *table AM* as some kind of pass-through
> collector then likely yes, though you may still want to do final sort
> in commit order to know which one is the latest version of updated
> tuples which needs to go in the index. The latter is not strictly
> needed, but would be a nice optimisation for oft-updated rows.

It is AM which is added as an index (with the same
columns/expressions/predicates) to the table before phase 1 starts.
So, all new tuples are inserted into it.

> And I would not collect just TID, but also the indexes value, as else
> we end up accessing the table in some random order for getting the
> value (and possibly do visibility checks)
Just TIDs - it is ordered at validation phase (while merging with an
main index) and read using AIO - pretty fast.

> I am not sure where we decide that tuple is HOT-updatable, but I
> suspect that it is before we call any index AMs, so STIR ios not
> guaranteed to solve the issues with HOT chains.

I am not sure what the HOT-chains issue is, but it actually works
correctly already, including stress tests.
It is even merged into one commercial fork of PG (I am not affiliated
with it in any way).

> (And yes, I have a patch in works to include old and new tids>  as part
> of logical decoding - they are "almost there", just not passed through
> - which would help here too to easily keep just the last value)

Yes, at least it is required for the REPACK case.

But....

Antonin already has a prototype of patch to enable logical decoding
for all kinds of tables in [0] (done in scope of REPACK).

So, if we have such mechanics in place, it looks nice (and almost the
same) for both CIC and REPACK:
* in both cases we create temporary slot to collect incoming tuples
* in both cases scan the table resetting snapshot every few pages to
keep xmin horizon propagate
* in both cases the process already collected part every few megabytes
* just the logic of using collected tuples is different...

So, yes, from terms of effectiveness STIR seems to be better, but such
a common approach like LD looks tempting to have for both REPACK/CIC.

On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
<[email protected]> wrote:
> -1: Requiring the logical decoding system just to reindex an index
>  without O(tablesize) lock time adds too much overhead,

How big is the additional cost of maintaining logical decoding for a
table? Could you please evolve a little bit?

Best regards,
Mikhail.


[0]: https://www.postgresql.org/message-id/152010.1751307725%40localhost
(v15-0007-Enable-logical-decoding-transiently-only-for-REPACK-.patch)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:31                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 19:08                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 20:41                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-28 21:01                                                           ` Hannu Krosing <[email protected]>
  0 siblings, 0 replies; 64+ messages in thread

From: Hannu Krosing @ 2025-11-28 21:01 UTC (permalink / raw)
  To: Mihail Nikalayeu <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>; Antonin Houska <[email protected]>

On Fri, Nov 28, 2025 at 9:42 PM Mihail Nikalayeu
<[email protected]> wrote:
>
> Hi, Hannu!
>
> I think you pressed "Reply" instead of "Reply All" - so, I put it to
> the list (looks like nothing is secret here).
> Mostly it is because of my opinion at the end of the mail which I want
> to share with the list.

Thanks, and yes, it was meant for the list.

> On Fri, Nov 28, 2025 at 8:33 PM Hannu Krosing <[email protected]> wrote:
> > If it is an *index AM* then this may not solve HOT chains issue (see
> > below), if we put it on top of *table AM* as some kind of pass-through
> > collector then likely yes, though you may still want to do final sort
> > in commit order to know which one is the latest version of updated
> > tuples which needs to go in the index. The latter is not strictly
> > needed, but would be a nice optimisation for oft-updated rows.
>
> It is AM which is added as an index (with the same
> columns/expressions/predicates) to the table before phase 1 starts.
> So, all new tuples are inserted into it.
>
> > And I would not collect just TID, but also the indexes value, as else
> > we end up accessing the table in some random order for getting the
> > value (and possibly do visibility checks)
> Just TIDs - it is ordered at validation phase (while merging with an
> main index) and read using AIO - pretty fast.

It is a space vs work compromise - you either collect it at once or
have to read it again later. Even pretty fast is still slower than
doing nothing :)

> > I am not sure where we decide that tuple is HOT-updatable, but I
> > suspect that it is before we call any index AMs, so STIR ios not
> > guaranteed to solve the issues with HOT chains.
>
> I am not sure what the HOT-chains issue is, but it actually works
> correctly already, including stress tests.
> It is even merged into one commercial fork of PG (I am not affiliated
> with it in any way).

It was about a simplistic approach for VACUUM to just ignore the CIC
backends and then missing some inserts.

> > (And yes, I have a patch in works to include old and new tids>  as part
> > of logical decoding - they are "almost there", just not passed through
> > - which would help here too to easily keep just the last value)
>
> Yes, at least it is required for the REPACK case.
>
> But....
>
> Antonin already has a prototype of patch to enable logical decoding
> for all kinds of tables in [0] (done in scope of REPACK).
>
> So, if we have such mechanics in place, it looks nice (and almost the
> same) for both CIC and REPACK:
> * in both cases we create temporary slot to collect incoming tuples
> * in both cases scan the table resetting snapshot every few pages to
> keep xmin horizon propagate
> * in both cases the process already collected part every few megabytes
> * just the logic of using collected tuples is different...
>
> So, yes, from terms of effectiveness STIR seems to be better, but such
> a common approach like LD looks tempting to have for both REPACK/CIC.

My reasoning was mainly that using something that already exists, and
must work correctly in any case, is a better long-term strategy than
adding complexity in multiple places.

After looking up when CIC appeared (v 8.2) and when logical decoding
came along (v9.4) I start to think that CIC probably would have used
LD if it had been available when CIC was added.

> On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> <[email protected]> wrote:
> > -1: Requiring the logical decoding system just to reindex an index
> >  without O(tablesize) lock time adds too much overhead,
>
> How big is the additional cost of maintaining logical decoding for a
> table? Could you please evolve a little bit?
>
> Best regards,
> Mikhail.
>
>
> [0]: https://www.postgresql.org/message-id/152010.1751307725%40localhost
> (v15-0007-Enable-logical-decoding-transiently-only-for-REPACK-.patch)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 17:58                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-11-28 18:31                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 19:08                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-12-02 12:02                                                         ` Matthias van de Meent <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Matthias van de Meent @ 2025-12-02 12:02 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; Álvaro Herrera <[email protected]>; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Fri, 28 Nov 2025 at 20:08, Hannu Krosing <[email protected]> wrote:
>
> On Fri, Nov 28, 2025 at 7:31 PM Matthias van de Meent
> <[email protected]> wrote:
> >
> > On Fri, 28 Nov 2025 at 18:58, Hannu Krosing <[email protected]> wrote:
> > >
> > > On Fri, Nov 28, 2025 at 5:58 PM Matthias van de Meent
> > > <[email protected]> wrote:
> > > >
> > > ...
> > > > I'm a bit worried, though, that LR may lose updates due to commit
> > > > order differences between WAL and PGPROC. I don't know how that's
> > > > handled in logical decoding, and can't find much literature about it
> > > > in the repo either.
> > >
> > > Now the reference to logical decoding made me think that maybe to real
> > > fix for CIC would be to leverage logical decoding for the 2nd pass of
> > > CIC and not wore about in-page visibilities at all.
> >
> > -1: Requiring the logical decoding system just to reindex an index
> > without O(tablesize) lock time adds too much overhead, and removes
> > features we currently have (CIC on unlogged tables). wal_level=logical
> > *must not* be required for these tasks if we can at all avoid it.
> > I'm also not sure whether logical decoding gets access to the HOT
> > information of the updated tuples involved, and therefore whether the
> > index build can determine whether it must or can't insert the tuple.
>
> There are more and more cases (not just CIC here) where using logical
> decoding would be the most efficient solution, so why not instead
> start improving it instead of complicating the system in various
> places?

Because Logical Replication implies Replication, which in turn implies
(more) WAL generation. And if an unlogged table still generates WAL in
DML, then it's not really an unlogged table, in which case we've
broken a promise to the user [see: CREATE TABLE's UNLOGGED
description]. Adding features to WAL which replicas can't (mustn't!)
do anything with is always going to be bloat in my view.

I also don't know how you measure efficiency, but I don't consider LR
to be particularly efficient in any metric, apart from maybe "wasting
DBA time with abandoned slots". LR parses WAL, which is a conveyor
belt with _all_ changes, and given that WAL has no real upper boundary
on how large it can grow, LR would have to touch an unbounded amount
of data to get only the changes it needs. We already have ways to get
those changes without parsing an unbounded amount of data, so why not
use that instead?

> We could even start selectively logging UNLOGGED and TEMP tables when
> we start CIC if CIC has enough upsides.

Which is why I hate this idea. There can't be enough upsides to
counteract the enormous downside of increasing the size of the data we
need to ship to replicas when the replicas can't ever use that data.
Replicas were able to use the added data of LR before 17 when they
were promoted, so it wasn't terrible to include more data in the WAL,
but what's proposed here is to add data that literally nobody on the
replica can use; wasting WAL storage and replication bandwidth.

Lastly, LR requires replication slots, which are very expensive to
maintain. Currently, you can do CIC/RIC with any number of backends
you want up to max_backends, but this doesn't work if you'd want to
use LR, as you'd now need to have max_replication_slots proportional
to max_connections.

Again, -1 on LR for UNLOGGED/TEMP tables. Or LR in general when the
user explicitly asked for `wal_level NOT IN ('logical')`

> > I don't think logical decoding is sufficient, because we don't know
> > which tuples were already inserted into the index by their own
> > backends, so we don't know which tuples' index entries we must skip.
>
> The premise of pass2 in CIC is that we collect all the rows that were
> inserted after CIC started for which we are not 100% sure that they
> are inserted in the index. We can only be sure they are inserted for
> transactions started after pass1 completed and the index became
> visible and available for inserts.

I'm not sure this is true; wouldn't it be possible for a transaction
to start before the index became visible, but because of READ
COMMITTED get access to the index after one statement? I.e. two
statements that straddle the index becoming visible? That way, a
transaction could start to see the index after it first modified some
tuples; creating a hybrid visibility state.

> And we do not care about hot update chains dusing normal CREATE INDEX
> or first pass of CIC - we just index what is visible NOW wit no regard
> of weather the tuple is at the end of HOT update chain.

We do care about HOT update chains, because the TID of the HOT root is
indexed, and not necessarily the TID of the scanned tuple.

> > PS. I think the same should be true for REPACK CONCURRENTLY, but
> > that's a new command with yet-to-be-determined semantics, unlike CIC
> > which has been part of PG for 6 years.
>
> CIC has been around way longer, since 8.2 released in 2006, so more
> like 20 years :)

Ah, so RIC wasn't introduced together with CIC? TIL.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-12-01 09:09                                                   ` Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-12-01 09:09 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Matthias van de Meent <[email protected]> wrote:

> I'm a bit worried, though, that LR may lose updates due to commit
> order differences between WAL and PGPROC. I don't know how that's
> handled in logical decoding, and can't find much literature about it
> in the repo either.

Can you please give me an example of this problem? I understand that two
transactions do this

T1: RecordTransactionCommit()
T2: RecordTransactionCommit()
T2: ProcArrayEndTransaction()
T1: ProcArrayEndTransaction()

but I'm failing to imagine this if both transactions are trying to update the
same row. For example, if T1 is updating a row that T2 wants to update as
well, then T2 has to wait for T1's call of ProcArrayEndTransaction() before it
can perform its update, and therefore it (T2) cannot start its commit sequence
before T1 has completed it:

T1: RecordTransactionCommit()
T1: ProcArrayEndTransaction()
T2: RecordTransactionCommit()
T2: ProcArrayEndTransaction()

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-12-02 10:51                                                     ` Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-12-02 10:51 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Mon, 1 Dec 2025 at 10:09, Antonin Houska <[email protected]> wrote:
>
> Matthias van de Meent <[email protected]> wrote:
>
> > I'm a bit worried, though, that LR may lose updates due to commit
> > order differences between WAL and PGPROC. I don't know how that's
> > handled in logical decoding, and can't find much literature about it
> > in the repo either.
>
> Can you please give me an example of this problem? I understand that two
> transactions do this
>
> T1: RecordTransactionCommit()
> T2: RecordTransactionCommit()
> T2: ProcArrayEndTransaction()
> T1: ProcArrayEndTransaction()
>
> but I'm failing to imagine this if both transactions are trying to update the
> same row.

Correct, it doesn't have anything to do with two transactions updating
the same row; but instead the same transaction getting applied twice;
related to issues described in (among others) [0]:
Logical replication applies transactions in WAL commit order, but
(normal) snapshots on the primary use the transaction's persistence
requirements (and procarray lock acquisition) as commit order.

This can cause the snapshot to see T2 as committed before T1, whilst
logical replication will apply transactions in T1 -> T2 order. This
can break the exactly-once expectations of commits, because a normal
snapshot taken between T2 and T1 on the primary (i.e., T2 is
considered committed, but T1 not) will have T2 already applied. LR
would have to apply changes of T1, which also implies it'd eventually
get to T2's commit and apply that too. Alternatively, it'd skip past
T2 because that's already present in the snapshot, and lose the
changes that were committed with T1.

I can't think of an ordering that applies all changes correctly
without either filtering which transactions to include in LR apply
steps, or LR's sync scan snapshots being different from normal
snapshots on the primary.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

[0] https://jepsen.io/analyses/amazon-rds-for-postgresql-17.4





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-12-04 08:34                                                       ` Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-12-04 08:34 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Matthias van de Meent <[email protected]> wrote:

> On Mon, 1 Dec 2025 at 10:09, Antonin Houska <[email protected]> wrote:
> >
> > Matthias van de Meent <[email protected]> wrote:
> >
> > > I'm a bit worried, though, that LR may lose updates due to commit
> > > order differences between WAL and PGPROC. I don't know how that's
> > > handled in logical decoding, and can't find much literature about it
> > > in the repo either.
> >
> > Can you please give me an example of this problem? I understand that two
> > transactions do this
> >
> > T1: RecordTransactionCommit()
> > T2: RecordTransactionCommit()
> > T2: ProcArrayEndTransaction()
> > T1: ProcArrayEndTransaction()
> >
> > but I'm failing to imagine this if both transactions are trying to update the
> > same row.
> 
> Correct, it doesn't have anything to do with two transactions updating
> the same row; but instead the same transaction getting applied twice;
> related to issues described in (among others) [0]:
> Logical replication applies transactions in WAL commit order, but
> (normal) snapshots on the primary use the transaction's persistence
> requirements (and procarray lock acquisition) as commit order.
> 
> This can cause the snapshot to see T2 as committed before T1, whilst
> logical replication will apply transactions in T1 -> T2 order. This
> can break the exactly-once expectations of commits, because a normal
> snapshot taken between T2 and T1 on the primary (i.e., T2 is
> considered committed, but T1 not) will have T2 already applied. LR
> would have to apply changes of T1, which also implies it'd eventually
> get to T2's commit and apply that too. Alternatively, it'd skip past
> T2 because that's already present in the snapshot, and lose the
> changes that were committed with T1.

ISTM that what you consider a problem is copying the table using PGPROC-based
snapshot and applying logically decoded commits to the result - is that what
you mean?

In fact, LR (and also REPACK) uses snapshots generated by the logical decoding
system. The information on running/committed transactions is based here on
replaying WAL, not on PGPROC. Thus if the snapshot sees T2 already applied, it
means that the T2's COMMIT record was already decoded, and therefore no data
change of that transaction should be passed to the output plugin (and
consequently applied to the new table).

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-12-04 16:32                                                         ` Matthias van de Meent <[email protected]>
  2025-12-04 19:15                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-12-04 16:32 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Thu, 4 Dec 2025 at 09:34, Antonin Houska <[email protected]> wrote:
>
> Matthias van de Meent <[email protected]> wrote:
>
> > On Mon, 1 Dec 2025 at 10:09, Antonin Houska <[email protected]> wrote:
> > >
> > > Matthias van de Meent <[email protected]> wrote:
> > >
> > > > I'm a bit worried, though, that LR may lose updates due to commit
> > > > order differences between WAL and PGPROC. I don't know how that's
> > > > handled in logical decoding, and can't find much literature about it
> > > > in the repo either.
> > >
> > > Can you please give me an example of this problem? I understand that two
> > > transactions do this
> > >
> > > T1: RecordTransactionCommit()
> > > T2: RecordTransactionCommit()
> > > T2: ProcArrayEndTransaction()
> > > T1: ProcArrayEndTransaction()
> > >
> > > but I'm failing to imagine this if both transactions are trying to update the
> > > same row.
> >
> > Correct, it doesn't have anything to do with two transactions updating
> > the same row; but instead the same transaction getting applied twice;
> > related to issues described in (among others) [0]:
> > Logical replication applies transactions in WAL commit order, but
> > (normal) snapshots on the primary use the transaction's persistence
> > requirements (and procarray lock acquisition) as commit order.
> >
> > This can cause the snapshot to see T2 as committed before T1, whilst
> > logical replication will apply transactions in T1 -> T2 order. This
> > can break the exactly-once expectations of commits, because a normal
> > snapshot taken between T2 and T1 on the primary (i.e., T2 is
> > considered committed, but T1 not) will have T2 already applied. LR
> > would have to apply changes of T1, which also implies it'd eventually
> > get to T2's commit and apply that too. Alternatively, it'd skip past
> > T2 because that's already present in the snapshot, and lose the
> > changes that were committed with T1.
>
> ISTM that what you consider a problem is copying the table using PGPROC-based
> snapshot and applying logically decoded commits to the result - is that what
> you mean?

Correct.

> In fact, LR (and also REPACK) uses snapshots generated by the logical decoding
> system. The information on running/committed transactions is based here on
> replaying WAL, not on PGPROC.

OK, that's good to know. For reference, do you know where this is
documented, explained, or implemented?

I'm asking, because the code that I could find didn't seem use any
special snapshot (tablesync.c uses
`PushActiveSnapshot(GetTransactionSnapshot())`), and the other
reference to LR's snapshots (snapbuild.c, and inside
`GetTransactionSnapshot()`) explicitly said that its snapshots are
only to be used for catalog lookups, never for general-purpose
queries.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-12-04 19:15                                                           ` Antonin Houska <[email protected]>
  2025-12-04 21:03                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-12-04 19:15 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; pgsql-hackers; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Matthias van de Meent <[email protected]> wrote:

> On Thu, 4 Dec 2025 at 09:34, Antonin Houska <[email protected]> wrote:
> >
> > ISTM that what you consider a problem is copying the table using PGPROC-based
> > snapshot and applying logically decoded commits to the result - is that what
> > you mean?
> 
> Correct.
> 
> > In fact, LR (and also REPACK) uses snapshots generated by the logical decoding
> > system. The information on running/committed transactions is based here on
> > replaying WAL, not on PGPROC.
> 
> OK, that's good to know. For reference, do you know where this is
> documented, explained, or implemented?

All my knowledge of these things is from source code.

> I'm asking, because the code that I could find didn't seem use any
> special snapshot (tablesync.c uses
> `PushActiveSnapshot(GetTransactionSnapshot())`),

My understanding is that this is what happens on the subscription side. Some
lines above that however, walrcv_create_slot(..., CRS_USE_SNAPSHOT, ...) is
called which in turn calls CreateReplicationSlot(..., CRS_USE_SNAPSHOT, ...)
on the publication side and it sets that snapshot for the transaction on the
remote (publication) side:

	else if (snapshot_action == CRS_USE_SNAPSHOT)
	{
		Snapshot	snap;

		snap = SnapBuildInitialSnapshot(ctx->snapshot_builder);
		RestoreTransactionSnapshot(snap, MyProc);
	}

> and the other
> reference to LR's snapshots (snapbuild.c, and inside
> `GetTransactionSnapshot()`) explicitly said that its snapshots are
> only to be used for catalog lookups, never for general-purpose
> queries.

I think the reason is that snapbuild.c only maintains snapshots for catalog
scans, because in logical decoding you only need to scan catalog tables. This
is especially to find out which tuple descriptor was valid when particular
data change (INSERT / UPDATE / DELETE) was WAL-logged - the output plugin
needs the correct version of tuple descriptor to deform each tuple. However
there is no need to scan non-catalog tables: as long as wal_level=logical, the
WAL records contains all the information needed for logical replication
(including key values). So snapbuild.c only keeps track of transactions that
modify system catalog and uses this information to create the snapshots.

A special case is if you pass need_full_snapshot=true to
CreateInitDecodingContext(). In this case the snapshot builder tracks commits
of all transactions, but only does so until SNAPBUILD_CONSISTENT state is
reached. Thus, just before the actual decoding starts, you can get a snapshot
to scan even non-catalog tables (SnapBuildInitialSnapshot() creates that, like
in the code above). (For REPACK, I'm trying to teach snapbuild.c recognize
that transaction changed one particular non-catalog table, so it can build
snapshots to scan this one table anytime.)

Another reason not to use those snapshots for non-catalog tables is that
snapbuild.c creates snapshots of the kind SNAPSHOT_HISTORIC_MVCC. If you used
this for non-catalog tables, HeapTupleSatisfiesHistoricMVCC() would be used
for visibility checks instead of HeapTupleSatisfiesMVCC(). The latter can
handle tuples surviving from older version of postgres, but the earlier
cannot:

	/* Used by pre-9.0 binary upgrades */
	if (tuple->t_infomask & HEAP_MOVED_OFF)

No such tuples should appear in the catalog because initdb always creates it
from scratch.

For LR, SnapBuildInitialSnapshot() takes care of the conversion from
SNAPSHOT_HISTORIC_MVCC to SNAPSHOT_MVCC.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 19:15                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-12-04 21:03                                                             ` Hannu Krosing <[email protected]>
  2025-12-16 13:43                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Heikki Linnakangas <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Hannu Krosing @ 2025-12-04 21:03 UTC (permalink / raw)
  To: pgsql-hackers; +Cc: Antonin Houska <[email protected]>; Matthias van de Meent <[email protected]>; Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

I just sent a small patch for logical decoding to pgsql-hackers@
exposing to logical decoding old and new tuple ids and a boolean
telling if an UPDATE is HOT.

Feel free to test if this helps here as well

On Thu, Dec 4, 2025 at 8:15 PM Antonin Houska <[email protected]> wrote:
>
> Matthias van de Meent <[email protected]> wrote:
>
> > On Thu, 4 Dec 2025 at 09:34, Antonin Houska <[email protected]> wrote:
> > >
> > > ISTM that what you consider a problem is copying the table using PGPROC-based
> > > snapshot and applying logically decoded commits to the result - is that what
> > > you mean?
> >
> > Correct.
> >
> > > In fact, LR (and also REPACK) uses snapshots generated by the logical decoding
> > > system. The information on running/committed transactions is based here on
> > > replaying WAL, not on PGPROC.
> >
> > OK, that's good to know. For reference, do you know where this is
> > documented, explained, or implemented?
>
> All my knowledge of these things is from source code.
>
> > I'm asking, because the code that I could find didn't seem use any
> > special snapshot (tablesync.c uses
> > `PushActiveSnapshot(GetTransactionSnapshot())`),
>
> My understanding is that this is what happens on the subscription side. Some
> lines above that however, walrcv_create_slot(..., CRS_USE_SNAPSHOT, ...) is
> called which in turn calls CreateReplicationSlot(..., CRS_USE_SNAPSHOT, ...)
> on the publication side and it sets that snapshot for the transaction on the
> remote (publication) side:
>
>         else if (snapshot_action == CRS_USE_SNAPSHOT)
>         {
>                 Snapshot        snap;
>
>                 snap = SnapBuildInitialSnapshot(ctx->snapshot_builder);
>                 RestoreTransactionSnapshot(snap, MyProc);
>         }
>
> > and the other
> > reference to LR's snapshots (snapbuild.c, and inside
> > `GetTransactionSnapshot()`) explicitly said that its snapshots are
> > only to be used for catalog lookups, never for general-purpose
> > queries.
>
> I think the reason is that snapbuild.c only maintains snapshots for catalog
> scans, because in logical decoding you only need to scan catalog tables. This
> is especially to find out which tuple descriptor was valid when particular
> data change (INSERT / UPDATE / DELETE) was WAL-logged - the output plugin
> needs the correct version of tuple descriptor to deform each tuple. However
> there is no need to scan non-catalog tables: as long as wal_level=logical, the
> WAL records contains all the information needed for logical replication
> (including key values). So snapbuild.c only keeps track of transactions that
> modify system catalog and uses this information to create the snapshots.
>
> A special case is if you pass need_full_snapshot=true to
> CreateInitDecodingContext(). In this case the snapshot builder tracks commits
> of all transactions, but only does so until SNAPBUILD_CONSISTENT state is
> reached. Thus, just before the actual decoding starts, you can get a snapshot
> to scan even non-catalog tables (SnapBuildInitialSnapshot() creates that, like
> in the code above). (For REPACK, I'm trying to teach snapbuild.c recognize
> that transaction changed one particular non-catalog table, so it can build
> snapshots to scan this one table anytime.)
>
> Another reason not to use those snapshots for non-catalog tables is that
> snapbuild.c creates snapshots of the kind SNAPSHOT_HISTORIC_MVCC. If you used
> this for non-catalog tables, HeapTupleSatisfiesHistoricMVCC() would be used
> for visibility checks instead of HeapTupleSatisfiesMVCC(). The latter can
> handle tuples surviving from older version of postgres, but the earlier
> cannot:
>
>         /* Used by pre-9.0 binary upgrades */
>         if (tuple->t_infomask & HEAP_MOVED_OFF)
>
> No such tuples should appear in the catalog because initdb always creates it
> from scratch.
>
> For LR, SnapBuildInitialSnapshot() takes care of the conversion from
> SNAPSHOT_HISTORIC_MVCC to SNAPSHOT_MVCC.
>
> --
> Antonin Houska
> Web: https://www.cybertec-postgresql.com
>
>





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 19:15                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 21:03                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
@ 2025-12-16 13:43                                                               ` Heikki Linnakangas <[email protected]>
  2025-12-16 21:58                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-12-17 18:22                                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Heikki Linnakangas @ 2025-12-16 13:43 UTC (permalink / raw)
  To: Hannu Krosing <[email protected]>; pgsql-hackers; +Cc: Antonin Houska <[email protected]>; Matthias van de Meent <[email protected]>; Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Didn't know which part of this thread to quote and reply to, so I'll 
comment on the whole thing. This is a mix of a summary of the ideas 
already discussed, and a new proposal.

Firstly, I think the STIR approach is the right approach at the high 
level. I don't like the logical decoding idea, for the reasons Matthias 
and Mikhail already mentioned. Maybe there's some synergy with REPACK, 
but it feels different enough that I doubt it. Let's focus on the STIR 
approach.

Summary of CIC as it is today
-----------------------------

To recap, the CIC approach at very high level is:

1. Build the index, while backends are modifying the table concurrently

2. Retail insert all the tuples that we missed in step 1.

A lot of logic and coordination goes into determining what was missed in 
step 1. Currently, it involves snapshots, waiting for concurrent 
transactions to finish, and re-scanning the index and the table.

The STIR idea is to maintain a little data structure on the side where 
we collect items that are inserted between steps 1 and 2, to avoid 
re-scanning the table.


Shmem struct
------------

One high-level observation:

We're using the catalog for inter-process communication, with the 
indisready and indisvalid flags, and now with STIR by having a special, 
ephemeral index AM. That feels unnecessarily difficult. I propose that 
we introduce a little shared memory struct to keep track of in-progress 
CONCURRENTLY index builds.

In the first transaction that inserts the catalog entry with 
indisready=false, also create a shmem struct. In that struct, we can 
store information about what state the build is in, and whether 
insertions should go to the STIR or to the real index.

Avoid one wait-for-all-transactions step using the shmem struct
---------------------------------------------------------------

As one small incremental improvement, we could use the shmem struct to 
avoid one of the "wait for all transactions" steps in the current 
implementation. In validate_index(), after we mark the index as 
'indisready' we have to wait for all transactions to finish, to ensure 
that all subsequent insertions have seen the indisready=true change. We 
could avoid that by setting a flag in the shmem struct instead, so that 
all backends would see instantly that the flag is flipped.


Improved STIR approach
----------------------

Here's another proposal using the STIR approach. It's a little different 
from the patches so far:

- Instead of having an ephemeral index AM, I'm imagining that 
index_insert() has access to the shmem struct, and knows about the STIR 
and can redirect insertions to it.

- I want to avoid re-scanning the index as well as the heap. To 
accomplish that, track more precisely which tuples are already in the 
index and which are not, by storing XID cutoffs in the shmem struct.


The proposal:

1. Insert the catalog entry with indisvalid = false and indisready = 
false. Commit the transaction.

2. Wait for all transactions to finish.
   - Now we know that all subsequently-started transactions will see the 
index and will take it into account when deciding HOT chains. (No 
changes to current implementation so far)
   - All subsequently-started transactions will now also check the shmem 
struct for the status of the index build, in index_insert(). We'll use 
the shmem struct to coordinate the later steps.

3. Atomically do the following:
3.1 Take snapshot A
3.2 Store the snapshot's xmax in the shmem struct where all concurrent 
backends can see it. Let's call this "cutoff A".

After this step, whenever a backend inserts a new tuple, it will append 
its TID to the STIR if the transaction's XID >= cutoff A. (No insertions 
to the actual index yet)

4. Build the index using snapshot A. It will include all tuples visible 
or in-progress according to the snapshot.

5. Atomically do the following:
5.1. Take snapshot B
5.2. Store the snapshot's xmax in the shmem struct. We'll call this 
cutoff B.

 From now on, backends insert all tuples >= cutoff B directly to the 
index. Tuples between A and B continue to be appended to the STIR.

6. Wait for all transactions < B to finish.

At this stage:
- All tuples < A are in the index. They were included in the bulk ambuild.
- All tuples between A and B are in the STIR.
- All tuples >= B are inserted to the index by the backends


7. Retail insert all the tuples from the STIR to the index.


Snapshot refreshing
-------------------

The above proposal doesn't directly accomplish the original goal of 
advancing the global xmin horizon. You still need two long-lived 
snapshots. It does however make CIC faster, by eliminating the full 
index scan and table scan in the validate_index() stage. That already 
helps a little.

I believe it can be extended to also advance xmin horizon:

- In step 4, while we are building the index, we can periodically get a 
new snapshot, update the cutoff in the shmem struct, and drain the STIR 
of the tuples that are already in it.

- In step 7, we can take a new snapshot as often as we like. The 
snapshot is only used to evaluate expressions.


- Heikki





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 19:15                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 21:03                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-16 13:43                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Heikki Linnakangas <[email protected]>
@ 2025-12-16 21:58                                                                 ` Mihail Nikalayeu <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-12-16 21:58 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: Hannu Krosing <[email protected]>; pgsql-hackers; Antonin Houska <[email protected]>; Matthias van de Meent <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

Hello, Heikki!

On Tue, Dec 16, 2025 at 2:43 PM Heikki Linnakangas <[email protected]> wrote:
> Firstly, I think the STIR approach is the right approach at the high
> level. I don't like the logical decoding idea, for the reasons Matthias
> and Mikhail already mentioned. Maybe there's some synergy with REPACK,
> but it feels different enough that I doubt it. Let's focus on the STIR
> approach.

Thanks for checking that thread.

> In the first transaction that inserts the catalog entry with
> indisready=false, also create a shmem struct. In that struct, we can
> store information about what state the build is in, and whether
> insertions should go to the STIR or to the real index.

Yes, it might look simpler, but from other point of view:
* we need to check that shmem for each index insert (whenever we build
something or not)
* or we need to put something into an index list with information
"write instead of that index into that shmem"
* currently we have some proven mechanics related to transactions,
catalog snapshots, relcache, invalidation etc. Some tricky
synchronization may be required here (to avoid any drift of way
transaction see shmem and relcache).

> As one small incremental improvement, we could use the shmem struct to
> avoid one of the "wait for all transactions" steps in the current
> implementation. In validate_index(), after we mark the index as
> 'indisready' we have to wait for all transactions to finish, to ensure
> that all subsequent insertions have seen the indisready=true change. We
> could avoid that by setting a flag in the shmem struct instead, so that
> all backends would see instantly that the flag is flipped.

That may be tricky. If I set a flag - what if someone checked it 1ns
ago and decided it is not required to write something in the index?
How to ensure that now everyone really knows about it without heavy
locking?
In all current maintenance operations we ensure in some way (by
locking\unlocking a relation or waiting for transactions) everyone has
fresh enough relcache. Don't think we should involve anything special
for the CIC scenario here.

But some universal solution (like ensuring that every other
transaction that had an outdated relcache is ended) may benefit all
related scenarios.

> Improved STIR approach
>
> Here's another proposal using the STIR approach. It's a little different
> from the patches so far:
> ....
> 7. Retail insert all the tuples from the STIR to the index.

Hm, that clever idea...
At the same time my tests show what index scan is light compared to
heap scans (especially second one - it is not paralleled).

> Snapshot refreshing
> -------------------
> - In step 4, while we are building the index, we can periodically get a
> new snapshot, update the cutoff in the shmem struct, and drain the STIR
> of the tuples that are already in it.

But together with snapshot resetting such an approach is still more
effective (in terms of index scan) but feels much more complex,
including some complex locking.
Need to think a little bit here.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
  2025-05-18 15:09 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-18 15:56   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Álvaro Herrera <[email protected]>
  2025-05-18 16:09     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-05-23 21:59       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 16:17         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-16 20:00           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-16 20:21             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-17 15:55               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 10:49                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 16:33                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-18 21:15                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-06-18 21:36                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-06-21 20:32                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-03 00:23                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-07-07 12:00                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Sergey Sargsyan <[email protected]>
  2025-07-10 14:30                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-05 00:25                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-09-28 09:26                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-10-28 18:37                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-09 18:02                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:40                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-27 18:59                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 20:07                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 14:50                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-28 16:57                                                 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-01 09:09                                                   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-02 10:51                                                     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 08:34                                                       ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 16:32                                                         ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-12-04 19:15                                                           ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-12-04 21:03                                                             ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Hannu Krosing <[email protected]>
  2025-12-16 13:43                                                               ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Heikki Linnakangas <[email protected]>
@ 2025-12-17 18:22                                                                 ` Matthias van de Meent <[email protected]>
  1 sibling, 0 replies; 64+ messages in thread

From: Matthias van de Meent @ 2025-12-17 18:22 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: Hannu Krosing <[email protected]>; pgsql-hackers; Antonin Houska <[email protected]>; Mihail Nikalayeu <[email protected]>; Sergey Sargsyan <[email protected]>; [email protected]; Andres Freund <[email protected]>; Michael Paquier <[email protected]>; Andrey Borodin <[email protected]>; Melanie Plageman <[email protected]>

On Tue, 16 Dec 2025 at 14:43, Heikki Linnakangas <[email protected]> wrote:
>
> Summary of CIC as it is today
> -----------------------------
>
> To recap, the CIC approach at very high level is:
>
> 1. Build the index, while backends are modifying the table concurrently
>
> 2. Retail insert all the tuples that we missed in step 1.
>
> A lot of logic and coordination goes into determining what was missed in
> step 1. Currently, it involves snapshots, waiting for concurrent
> transactions to finish, and re-scanning the index and the table.
>
> The STIR idea is to maintain a little data structure on the side where
> we collect items that are inserted between steps 1 and 2, to avoid
> re-scanning the table.

During step 1, up to step 2, indeed.

> Shmem struct
> ------------
>
> One high-level observation:
>
> We're using the catalog for inter-process communication, with the
> indisready and indisvalid flags, and now with STIR by having a special,
> ephemeral index AM. That feels unnecessarily difficult. I propose that
> we introduce a little shared memory struct to keep track of in-progress
> CONCURRENTLY index builds.
>
> In the first transaction that inserts the catalog entry with
> indisready=false, also create a shmem struct. In that struct, we can
> store information about what state the build is in, and whether
> insertions should go to the STIR or to the real index.
>
> Avoid one wait-for-all-transactions step using the shmem struct
> ---------------------------------------------------------------
>
> As one small incremental improvement, we could use the shmem struct to
> avoid one of the "wait for all transactions" steps in the current
> implementation. In validate_index(), after we mark the index as
> 'indisready' we have to wait for all transactions to finish, to ensure
> that all subsequent insertions have seen the indisready=true change. We
> could avoid that by setting a flag in the shmem struct instead, so that
> all backends would see instantly that the flag is flipped.
>
>
> Improved STIR approach
> ----------------------
>
> Here's another proposal using the STIR approach. It's a little different
> from the patches so far:
> [many steps]

I am not convinced that this new approach is correct, as it introduces
too many new moving components into concurent index creation. I'm
quite concerned about the correctness around snapshot xmax checks:
while the approach is in a different system than that of PG14.0's CIC
changes, I can't help but think that this will open another can of
worms that is similar to that bug; it also changes expectations about
snapshot contents (by conditionally including tuples in the
"visibility" of STIR data structure), and I'm not even sure that it
guarantees that STIR contains all possibly-visible-after-CIC tuples
that aren't visible in the snapshot(s) of the main table scan.

So, let's not complicate these changes more than what they already are.

> Snapshot refreshing
> -------------------
>
> The above proposal doesn't directly accomplish the original goal of
> advancing the global xmin horizon. You still need two long-lived
> snapshots. It does however make CIC faster, by eliminating the full
> index scan and table scan in the validate_index() stage. That already
> helps a little.
>
> I believe it can be extended to also advance xmin horizon:
>
> - In step 4, while we are building the index, we can periodically get a
> new snapshot, update the cutoff in the shmem struct, and drain the STIR
> of the tuples that are already in it.

I'm against periodic draining of STIR:

1. With these periodic index scans, each heap page may be accessed
many more times than the current 2 times, as heap pages may be updated
at least once every time STIR is drained (up to MaxHeapTuplesPerPage);
thus strictly increasing the heap IO requirement compared to only
using STIR at phase 2.
2. The STIR index will contain TIDs of pages we have yet to scan; we'd
have to filter these out if we want to prevent duplicate TID insertion
(or we would risk having the STIR TID visible in an upcoming
visibility snapshot and inserting it twice).
3. The STIR would contain TIDs that may already be dead by the end of
the first phase, scanning the STIR index early could mean we'd still
have added it to the index.
4. The current approach to re-snapshotting happens completely inside
the table AM. Adding STIR draining to this phase would require
completely new code inside tableAMs or inside index AMs. It doesn't
make much sense for a tableAM to know about this.

All together I think those drawbacks for periodically draining STIR
are too significant to consider right now.

> - In step 7, we can take a new snapshot as often as we like. The
> snapshot is only used to evaluate expressions.

We shouldn't try to insert dead tuples into the index, so shouldn't we
also use some visibility checks in that step?


Kind regards,

Matthias van de Meent





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
@ 2025-11-27 16:56 Antonin Houska <[email protected]>
  2025-11-27 17:40 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  2025-11-27 18:22 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 2 replies; 64+ messages in thread

From: Antonin Houska @ 2025-11-27 16:56 UTC (permalink / raw)
  To: Michail Nikolaev <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Michail Nikolaev <[email protected]> wrote:

> I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> improvements) in some lighter way.

I haven't read the whole thread yet, but the effort to minimize the impact of
C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
necessary to distinguish dead tuple from a live one. And w/o snapshot, the
backend executing C/RIC does not restrict VACUUM on other tables.

Concurrent (re)build of unique index appears to be another topic of this
thread, but I think this approach should handle the problem too. The workflow
is:

1. Create an empty index.

2. Wait until all transactions are aware of the index, so they take the new
   index into account when deciding on new HOT chains. (This is already
   implemented.)

3. Set the 'indisready' flag so the index is ready for insertions.

4. While other transactions can insert their tuples into the index now,
   process the table one page at a time this way:

   4.1 Acquire (shared) content lock on the buffer.

   4.3 Collect the root tuples of HOT chains - these and only these need to be
       inserted into the index.

   4.4 Unlock the buffer.

5. Once the whole table is processed, insert the collected tuples into the
   index.

   To avoid insertions of tuples that concurrent transactions have just
   inserted, we'd need something like index.c:validate_index() (i.e. insert
   into the index only the tuples that it does not contain yet), but w/o
   snapshot because we already have the heap tuples collected.

   Also it'd make sense to wait for completion of all the transactions that
   currently have the table locked for INSERT/UPDATE: some of these might have
   inserted their tuples into the heap, but not yet into the index. If we
   included some of those tuples into our collection and insert them into the
   index first, the other transactions could end up with ERROR when inserting
   those tuples again.

6. Set the 'indisvalid' flag so that the index can be used by queries.

Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
page pruning triggered by queries (heap_page_prune_opt) should not be
disruptive. Actually C/RIC can do the pruning itself it it appears to be
useful. For example, if whole HOT chain should be considered DEAD by the next
VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
do not insert TID of the root tuple into the index unnecessarily.

I can even think of letting VACUUM run on the same table that C/RIC is
processing. In that case, interlocking would take place at page level: either
C/RIC or VACUUM can acquire lock for particular page, but not both. This would
be useful in cases C/RIC takes very long time.

In this case, C/RIC *must not* insert TIDs of dead tuples into the index at
all. Otherwise there could be race conditions such that VACUUM removes dead
tuples from the index and marks the corresponding heap items as UNUSED, but
C/RIC then re-inserts the index tuples.

To avoid this problem, C/RIC needs to re-check each TID before it inserts it
into the index and skip the insertion if the tuple (or the whole HOT-chain
starting at this tuple) it points to is DEAD according to the OldestXmin that
the most recent VACUUM used. (VACUUM could perhaps advertise its OldestXmin
for C/RIC via shared memory.)

Also, before this re-checking starts, it must be ensured that VACUUM does not
start again, until the index creation is complete: a new run of VACUUM implies
a new value of OldestXmin, i.e. need for more stringent re-checking of the
heap tuples.

Related question is which OldestXmin to use in the step 4.3. One option is to
use *exactly* the OldestXmin shared VACUUM. However that wouldn't work if
VACUUM starts while C/RIC is already in progress. (Which seems like a
significant restriction.)

Another option is to get the OldestXmin in the same way as VACUUM
does. However, the value can thus be different from the one used by VACUUM:
older if retrieved before VACUUM started and newer if retrieved while VACUUM
was already running. The first case can be handled by the heap tuple
re-checking (see above). The latter implies that, before setting 'indisvalid',
C/RIC has to wait until all snapshots have their xmin >= this (more recent)
OldestXmin. Otherwise some snapshots could miss data they should see.

(An implication of the rule that C/RIC must not insert TIDs of dead tuples
into the index is that VACUUM does not have to call the index AM bulk delete
while C/RIC is running for that index. This would be just an optimization.)

Of course, I could have missed some important point, so please explain why
this concept is broken :-) Or let me know if something needs to be explained
more in detail. Thanks.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-11-27 17:40 ` Mihail Nikalayeu <[email protected]>
  2025-11-27 18:57   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-27 17:40 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hello, Antonin!

> I haven't read the whole thread yet, but the effort to minimize the impact of
> C/RIC on VACUUM seems to prevail
Yes, the thread is super long and probably you missed annotations to
most important emails in [0].

> Of course, I could have missed some important point, so please explain why
> this concept is broken :-) Or let me know if something needs to be explained
> more in detail. Thanks.

Looks like your idea is not broken, but... It is actually an almost
1-1 to idea used in the "full" version of the patch.
Explanations are available in [1] and [2].
In [3] I reduced the patch scope to find a solution compatible with REPACK.

Few comments:

> 1. Create an empty index.
Yes, patch does exactly the same, introducing special lightweight AM -
STIR (Short Term Index Replacement) to collect new tuples.

> 4.1 Acquire (shared) content lock on the buffer.
>  4.3 Collect the root tuples of HOT chains - these and only these need to be
       inserted into the index.
>   4.4 Unlock the buffer.

Instead of such technique essentially the same is used - it keeps the
snapshot to be used, it just rotates it every few pages for a fresh
one.
It solves some of the issues with selection of alive tuples you
mentioned without any additional logic.

> Concurrent (re)build of unique index appears to be another topic of this
> thread, but I think this approach should handle the problem too.
It is solved with a special commit in the original patchset.

You know, clever people think the same :)
Interesting fact, it is not the first time - at [4] Sergey also
proposed an idea of an "empty" index to collect tuples (which gives
the single scan).

So, it is very good knews the approach feels valid for multiple people
(also Mathias introduced the idea about "fresh snapshot"~"no snapshot"
initially).

One thing I am not happy about - it is not applicable to the REPACK case.

Best regards,
Mikhail.

[0]: https://commitfest.postgresql.org/patch/4971/
[1]: https://www.postgresql.org/message-id/[email protected]...
[2]: https://www.postgresql.org/message-id/[email protected]...
[3]: https://www.postgresql.org/message-id/[email protected]...
[4]: https://www.postgresql.org/message-id/flat/CAMAof6_FY0MrNJOuBrqvQqJKiwskFvjRtgpVHf-D7A%3DKvTtYXg%40m...

^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-11-27 17:40 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Mihail Nikalayeu <[email protected]>
@ 2025-11-27 18:57   ` Mihail Nikalayeu <[email protected]>
  0 siblings, 0 replies; 64+ messages in thread

From: Mihail Nikalayeu @ 2025-11-27 18:57 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: pgsql-hackers; Alvaro Herrera <[email protected]>

Hi, Antonin!

On Thu, Nov 27, 2025 at 6:40 PM Mihail Nikalayeu
<[email protected]> wrote:
> > 1. Create an empty index.
> Yes, patch does exactly the same, introducing special lightweight AM -
> STIR (Short Term Index Replacement) to collect new tuples.

Initially understood incorrectly - in your solution you propose to use
a single index.
But STIR is used to collect new coming tuples, while the main index is
built using a batched way.

> To avoid insertions of tuples that concurrent transactions have just
> inserted, we'd need something like index.c:validate_index() (i.e. insert
> into the index only the tuples that it does not contain yet), but w/o
> snapshot because we already have the heap tuples collected.

And later main and STIR are merged.

Best regards,
Mikhail.





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-11-27 18:22 ` Matthias van de Meent <[email protected]>
  2025-11-28 09:05   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  1 sibling, 1 reply; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-27 18:22 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Thu, 27 Nov 2025 at 17:56, Antonin Houska <[email protected]> wrote:
>
> Michail Nikolaev <[email protected]> wrote:
>
> > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> > improvements) in some lighter way.
>
> I haven't read the whole thread yet, but the effort to minimize the impact of
> C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
> idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
> necessary to distinguish dead tuple from a live one. And w/o snapshot, the
> backend executing C/RIC does not restrict VACUUM on other tables.
>
> Concurrent (re)build of unique index appears to be another topic of this
> thread, but I think this approach should handle the problem too. The workflow
> is:
>
> 1. Create an empty index.
>
> 2. Wait until all transactions are aware of the index, so they take the new
>    index into account when deciding on new HOT chains. (This is already
>    implemented.)
>
> 3. Set the 'indisready' flag so the index is ready for insertions.
>
> 4. While other transactions can insert their tuples into the index now,
>    process the table one page at a time this way:
>
>    4.1 Acquire (shared) content lock on the buffer.
>
>    4.3 Collect the root tuples of HOT chains - these and only these need to be
>        inserted into the index.
>
>    4.4 Unlock the buffer.


> 5. Once the whole table is processed, insert the collected tuples into the
>    index.
>
>    To avoid insertions of tuples that concurrent transactions have just
>    inserted, we'd need something like index.c:validate_index() (i.e. insert
>    into the index only the tuples that it does not contain yet), but w/o
>    snapshot because we already have the heap tuples collected.
>
>    Also it'd make sense to wait for completion of all the transactions that
>    currently have the table locked for INSERT/UPDATE: some of these might have
>    inserted their tuples into the heap, but not yet into the index. If we
>    included some of those tuples into our collection and insert them into the
>    index first, the other transactions could end up with ERROR when inserting
>    those tuples again.
>
> 6. Set the 'indisvalid' flag so that the index can be used by queries.
>
> Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
> page pruning triggered by queries (heap_page_prune_opt) should not be
> disruptive. Actually C/RIC can do the pruning itself it it appears to be
> useful. For example, if whole HOT chain should be considered DEAD by the next
> VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
> do not insert TID of the root tuple into the index unnecessarily.
[...]
> Of course, I could have missed some important point, so please explain why
> this concept is broken :-) Or let me know if something needs to be explained
> more in detail. Thanks.

1. When do you select and insert tuples that aren't part of a hot
chain into the index, i.e. tuples that were never updated after they
got inserted into the table? Or is every tuple "part of a hot chain"
even if the tuple wasn't ever updated?

2. HOT chains can be created while the index wasn't yet present, and
thus the indexed attributes of the root tuples can be different from
the most current tuple of a chain. If you only gather root tuples, we
could index incorrect data for that HOT chain. The correct approach
here is to index only the visible tuples, as those won't have been
updated in a non-HOT manner without all indexed attributes being
unchanged.

3. Having the index marked indisready before it contains any data is
going to slow down the indexing process significantly:
a. The main index build now must go through shared memory and buffer
locking, instead of being able to use backend-local memory
b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
significantly higher overhead than the bulk insertion logic in
ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
spent.

So, I don't think moving away from ambuild() as basis for initially
building the index this is such a great idea.

(However, I do think that having an _option_ to build the index using
ambuildempty()+aminsert() instead of ambuild() might be useful, if
only to more easily compare "natural grown" indexes vs freshly built
ones, but that's completely orthogonal to CIC snapshotting
improvements.)


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-11-27 18:22 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
@ 2025-11-28 09:05   ` Antonin Houska <[email protected]>
  2025-11-28 09:51     ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  0 siblings, 1 reply; 64+ messages in thread

From: Antonin Houska @ 2025-11-28 09:05 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

Matthias van de Meent <[email protected]> wrote:

> On Thu, 27 Nov 2025 at 17:56, Antonin Houska <[email protected]> wrote:
> >
> > Michail Nikolaev <[email protected]> wrote:
> >
> > > I think about revisiting (1) ({CREATE INDEX, REINDEX} CONCURRENTLY
> > > improvements) in some lighter way.
> >
> > I haven't read the whole thread yet, but the effort to minimize the impact of
> > C/RIC on VACUUM seems to prevail. Following is one more proposal. The core
> > idea is that C/RIC should avoid indexing dead tuples, however snapshot is not
> > necessary to distinguish dead tuple from a live one. And w/o snapshot, the
> > backend executing C/RIC does not restrict VACUUM on other tables.
> >
> > Concurrent (re)build of unique index appears to be another topic of this
> > thread, but I think this approach should handle the problem too. The workflow
> > is:
> >
> > 1. Create an empty index.
> >
> > 2. Wait until all transactions are aware of the index, so they take the new
> >    index into account when deciding on new HOT chains. (This is already
> >    implemented.)
> >
> > 3. Set the 'indisready' flag so the index is ready for insertions.
> >
> > 4. While other transactions can insert their tuples into the index now,
> >    process the table one page at a time this way:
> >
> >    4.1 Acquire (shared) content lock on the buffer.
> >
> >    4.3 Collect the root tuples of HOT chains - these and only these need to be
> >        inserted into the index.
> >
> >    4.4 Unlock the buffer.
> 
> 
> > 5. Once the whole table is processed, insert the collected tuples into the
> >    index.
> >
> >    To avoid insertions of tuples that concurrent transactions have just
> >    inserted, we'd need something like index.c:validate_index() (i.e. insert
> >    into the index only the tuples that it does not contain yet), but w/o
> >    snapshot because we already have the heap tuples collected.
> >
> >    Also it'd make sense to wait for completion of all the transactions that
> >    currently have the table locked for INSERT/UPDATE: some of these might have
> >    inserted their tuples into the heap, but not yet into the index. If we
> >    included some of those tuples into our collection and insert them into the
> >    index first, the other transactions could end up with ERROR when inserting
> >    those tuples again.
> >
> > 6. Set the 'indisvalid' flag so that the index can be used by queries.
> >
> > Note on pruning: As we only deal with the root tuples of HOT chains (4.3),
> > page pruning triggered by queries (heap_page_prune_opt) should not be
> > disruptive. Actually C/RIC can do the pruning itself it it appears to be
> > useful. For example, if whole HOT chain should be considered DEAD by the next
> > VACUUM, pruning is likely (depending on the OldestXid) to remove it so that we
> > do not insert TID of the root tuple into the index unnecessarily.
> [...]
> > Of course, I could have missed some important point, so please explain why
> > this concept is broken :-) Or let me know if something needs to be explained
> > more in detail. Thanks.
> 
> 1. When do you select and insert tuples that aren't part of a hot
> chain into the index, i.e. tuples that were never updated after they
> got inserted into the table? Or is every tuple "part of a hot chain"
> even if the tuple wasn't ever updated?

Right, I considered "standalone tuple" a HOT chain of length 1. So it'll be
picked too.

> 2. HOT chains can be created while the index wasn't yet present, and
> thus the indexed attributes of the root tuples can be different from
> the most current tuple of a chain. If you only gather root tuples, we
> could index incorrect data for that HOT chain. The correct approach
> here is to index only the visible tuples, as those won't have been
> updated in a non-HOT manner without all indexed attributes being
> unchanged.

Good point.

> 3. Having the index marked indisready before it contains any data is
> going to slow down the indexing process significantly:
> a. The main index build now must go through shared memory and buffer
> locking, instead of being able to use backend-local memory
> b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
> significantly higher overhead than the bulk insertion logic in
> ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
> spent.
> 
> So, I don't think moving away from ambuild() as basis for initially
> building the index this is such a great idea.
> 
> (However, I do think that having an _option_ to build the index using
> ambuildempty()+aminsert() instead of ambuild() might be useful, if
> only to more easily compare "natural grown" indexes vs freshly built
> ones, but that's completely orthogonal to CIC snapshotting
> improvements.)

The retail insertions are not something this proposal depends on. I think it'd
be possible to build a separate index locally and then "merge" it with the
regular one. I just tried to propose a solution that does not need snapshots.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

* Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
  2025-11-27 18:22 ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Matthias van de Meent <[email protected]>
  2025-11-28 09:05   ` Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
@ 2025-11-28 09:51     ` Matthias van de Meent <[email protected]>
  0 siblings, 0 replies; 64+ messages in thread

From: Matthias van de Meent @ 2025-11-28 09:51 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Michail Nikolaev <[email protected]>; pgsql-hackers; Alvaro Herrera <[email protected]>

On Fri, 28 Nov 2025 at 10:05, Antonin Houska <[email protected]> wrote:
> Matthias van de Meent <[email protected]> wrote:
> > 3. Having the index marked indisready before it contains any data is
> > going to slow down the indexing process significantly:
> > a. The main index build now must go through shared memory and buffer
> > locking, instead of being able to use backend-local memory
> > b. The tuple-wise insertion path (IndexAmRoutine->aminsert) can have a
> > significantly higher overhead than the bulk insertion logic in
> > ambuild(); in metrics of WAL, pages accessed (IO), and CPU cycles
> > spent.
> >
> > So, I don't think moving away from ambuild() as basis for initially
> > building the index this is such a great idea.
> >
> > (However, I do think that having an _option_ to build the index using
> > ambuildempty()+aminsert() instead of ambuild() might be useful, if
> > only to more easily compare "natural grown" indexes vs freshly built
> > ones, but that's completely orthogonal to CIC snapshotting
> > improvements.)
>
> The retail insertions are not something this proposal depends on. I think it'd
> be possible to build a separate index locally and then "merge" it with the
> regular one. I just tried to propose a solution that does not need snapshots.

I'm not sure we can generalize indexes to the point where merging two
built indexes is always both possible and efficient.

For example, the ANN indexes of pgvector (both HNSW and IVF) could
possibly have merge operations between indexes of the same type and
schema, but it would require a lot of effort on the side of the AM to
support merging; there is no trivial merge operation that also retains
the quality of the index without going through the aminsert() path.
Conversely, the current approach to CIC doesn't require additional
work on the index AM's side, and that's a huge enabler for every kind
of index.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 64+ messages in thread

end of thread, other threads:[~2026-04-13 01:05 UTC | newest]

Thread overview: 64+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-03-07 22:58 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Michail Nikolaev <[email protected]>
2025-05-18 15:09 ` Mihail Nikalayeu <[email protected]>
2025-05-18 15:56   ` Álvaro Herrera <[email protected]>
2025-05-18 16:09     ` Mihail Nikalayeu <[email protected]>
2025-05-23 21:59       ` Mihail Nikalayeu <[email protected]>
2025-06-16 16:17         ` Sergey Sargsyan <[email protected]>
2025-06-16 20:00           ` Mihail Nikalayeu <[email protected]>
2025-06-16 20:21             ` Sergey Sargsyan <[email protected]>
2025-06-17 15:55               ` Sergey Sargsyan <[email protected]>
2025-06-18 10:49                 ` Mihail Nikalayeu <[email protected]>
2025-06-18 16:33                   ` Sergey Sargsyan <[email protected]>
2025-06-18 21:15                     ` Mihail Nikalayeu <[email protected]>
2025-06-18 21:36                       ` Sergey Sargsyan <[email protected]>
2025-06-21 20:32                         ` Mihail Nikalayeu <[email protected]>
2025-07-03 00:23                           ` Mihail Nikalayeu <[email protected]>
2025-07-07 12:00                             ` Sergey Sargsyan <[email protected]>
2025-07-10 14:30                               ` Mihail Nikalayeu <[email protected]>
2025-09-05 00:25                                 ` Mihail Nikalayeu <[email protected]>
2025-09-28 09:26                                   ` Mihail Nikalayeu <[email protected]>
2025-10-28 18:37                                     ` Mihail Nikalayeu <[email protected]>
2025-11-09 18:02                                       ` Mihail Nikalayeu <[email protected]>
2025-11-22 17:08                                         ` Mihail Nikalayeu <[email protected]>
2025-11-27 18:40                                         ` Matthias van de Meent <[email protected]>
2025-11-27 18:59                                           ` Mihail Nikalayeu <[email protected]>
2025-11-27 20:07                                             ` Matthias van de Meent <[email protected]>
2025-11-28 14:50                                               ` Mihail Nikalayeu <[email protected]>
2025-11-28 16:57                                                 ` Matthias van de Meent <[email protected]>
2025-11-28 17:58                                                   ` Hannu Krosing <[email protected]>
2025-11-28 18:05                                                     ` Hannu Krosing <[email protected]>
2025-11-28 18:40                                                       ` Mihail Nikalayeu <[email protected]>
2025-12-01 10:29                                                       ` Antonin Houska <[email protected]>
2025-12-01 10:49                                                         ` Mihail Nikalayeu <[email protected]>
2025-12-02 07:28                                                           ` Antonin Houska <[email protected]>
2025-12-02 10:27                                                             ` Mihail Nikalayeu <[email protected]>
2025-12-02 11:12                                                               ` Matthias van de Meent <[email protected]>
2026-03-09 00:09                                                                 ` Mihail Nikalayeu <[email protected]>
2026-03-23 22:08                                                                   ` Mihail Nikalayeu <[email protected]>
2026-03-28 19:17                                                                     ` Mihail Nikalayeu <[email protected]>
2026-03-31 22:11                                                                       ` Mihail Nikalayeu <[email protected]>
2026-04-06 18:21                                                                         ` Mihail Nikalayeu <[email protected]>
2026-04-07 01:42                                                                           ` Josh Kupershmidt <[email protected]>
2026-04-07 23:19                                                                             ` Mihail Nikalayeu <[email protected]>
2026-04-11 16:56                                                                               ` Mihail Nikalayeu <[email protected]>
2026-04-13 01:05                                                                               ` Josh Kupershmidt <[email protected]>
2025-11-28 18:31                                                     ` Matthias van de Meent <[email protected]>
2025-11-28 19:08                                                       ` Hannu Krosing <[email protected]>
2025-11-28 20:41                                                         ` Mihail Nikalayeu <[email protected]>
2025-11-28 21:01                                                           ` Hannu Krosing <[email protected]>
2025-12-02 12:02                                                         ` Matthias van de Meent <[email protected]>
2025-12-01 09:09                                                   ` Antonin Houska <[email protected]>
2025-12-02 10:51                                                     ` Matthias van de Meent <[email protected]>
2025-12-04 08:34                                                       ` Antonin Houska <[email protected]>
2025-12-04 16:32                                                         ` Matthias van de Meent <[email protected]>
2025-12-04 19:15                                                           ` Antonin Houska <[email protected]>
2025-12-04 21:03                                                             ` Hannu Krosing <[email protected]>
2025-12-16 13:43                                                               ` Heikki Linnakangas <[email protected]>
2025-12-16 21:58                                                                 ` Mihail Nikalayeu <[email protected]>
2025-12-17 18:22                                                                 ` Matthias van de Meent <[email protected]>
2025-11-27 16:56 Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements Antonin Houska <[email protected]>
2025-11-27 17:40 ` Mihail Nikalayeu <[email protected]>
2025-11-27 18:57   ` Mihail Nikalayeu <[email protected]>
2025-11-27 18:22 ` Matthias van de Meent <[email protected]>
2025-11-28 09:05   ` Antonin Houska <[email protected]>
2025-11-28 09:51     ` Matthias van de Meent <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox