public inbox for [email protected]  
help / color / mirror / Atom feed
From: Michail Nikolaev <[email protected]>
To: Matthias van de Meent <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Andrey Borodin <[email protected]>
Cc: Melanie Plageman <[email protected]>
Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
Date: Tue, 24 Dec 2024 20:39:23 +0100
Message-ID: <CANtu0ogfGHUQpHgkGXAv0wamD=kW_NcEep8ZAp9SKvFRNz0FLQ@mail.gmail.com> (raw)
In-Reply-To: <CANtu0oiD-AvXdygYqYP-WkFq=7vSL78Wj8UU-PUX+3huPNqroQ@mail.gmail.com>
References: <CANtu0oiLc-+7h9zfzOVy2cv2UuYk_5MUReVLnVbOay6OgD_KGg@mail.gmail.com>
	<CAEze2WgW6pj48xJhG_YLUE1QS+n9Yv0AZQwaWeb-r+X=HAxU_g@mail.gmail.com>
	<CANtu0oizNtPUrPB0Mh+2vyjdijTX=LZvO5_dZN3+NqvE-CFPtw@mail.gmail.com>
	<CAEze2Wi3BFLkFBcZ+Brfbr-mGBCcWXcWuHucnCnw5ZOQotc6Eg@mail.gmail.com>
	<CANtu0ojRX=osoiXL9JJG6g6qOowXVbVYX+mDsN+2jmFVe=eG7w@mail.gmail.com>
	<CAEze2Wg03Ps_StwEhgCdSn7VXY9ZUM=zCrf-m1dRZpTWv6wD_A@mail.gmail.com>
	<CANtu0oj66JjAq8xyRSeO=MuRHYS2XsYbhHRRESHtOcLJs=3+Sw@mail.gmail.com>
	<CANtu0ogT2Qn7-q_jK6+DqBQvFoTt69eQJDKxJARXV9pdWjd0Gg@mail.gmail.com>
	<CANtu0ogXgNkEuxbDRwznAZpxEXRmj3NzOen3y-RGHDwig0YBRw@mail.gmail.com>
	<CANtu0oi+FTMqDb+6Bv8w7VHiTFVMB1uAAip_P841WQH+ktPixw@mail.gmail.com>
	<CAEze2WgeyVnDb_j4gJQYC4+HcSsYQAdeRA1-F0KDnJ=Y0A_TzA@mail.gmail.com>
	<CANtu0oga9zqqEFhdmcWyJTK4d6EGMJsMB_LMgVSE8ar0xVm7Ew@mail.gmail.com>
	<CANtu0oirtBK_g4jxtw3jehSop3b0WSQaek5Sv5OGSXwxgcHwZQ@mail.gmail.com>
	<CANtu0oijWPRGRpaRR_OvT2R5YALzscvcOTFh-=uZKUpNJmuZtw@mail.gmail.com>
	<CAEze2WgHFnYdxkNUmvqxOc-cFUNEYaTqL7+Pei=CtA-ZrTOFyw@mail.gmail.com>
	<CANtu0oipL3e8fLnejbH4HnByMW6G_auR4v+ns8j-UHhuPW=9og@mail.gmail.com>
	<CANtu0ojmVw8GW5bJknnqSp7Dp1xEuoBewdu2imtQ2tGnWpiWEg@mail.gmail.com>
	<CAEze2WgNHTWfw_bP6O0zW_=vi1D-yi1nh6-JDj9kd=8UaB-zLA@mail.gmail.com>
	<CANtu0ojA5=rT8BN5==OAiQJZh8CAxD_U8thFhZ3mwrZQ6roNOA@mail.gmail.com>
	<CAEze2Wh3eSAnXFdY_6roNPb3WD-YsKbNLiKf=cPmAGHkPUd22w@mail.gmail.com>
	<CANtu0og_=ypCbH2ZFayn44i=CL0HAXKW390LfZhQ1F56HoFXtQ@mail.gmail.com>
	<CAEze2WghpUS29bJJh5GCZ+WtpO4qWmxiFF-CTWFiP4Qq62G58w@mail.gmail.com>
	<CANtu0oiuuGRvYRsH-y0iQjfc+JpT9o4mPUXVkz97+sW9BXA+FA@mail.gmail.com>
	<CANtu0og6P+O10XLm-AsoqOhZYEjr8SEHFETadSJ8ifO01YP1qg@mail.gmail.com>
	<CANtu0oj0pakvxXhHJhsiKgk=ywY57m623G=OvhJnLVFWe9JCpg@mail.gmail.com>
	<CANtu0oiOj65kZqP8ngnsc9O+gywUJATgOjOP6pUARXWsmS9cBQ@mail.gmail.com>
	<CANtu0oiT9SPFhs=h8DR4YVPox7TJ6jkfR9JgqT-0L+=uy=Lxng@mail.gmail.com>
	<CANtu0ogBOtd9ravu1CUbuZWgq6qvn1rny38PGKDPk9zzQPH8_A@mail.gmail.com>
	<CAEze2Wj9SgwOpe_1CWnS_D-txQaQyXArR=dm4DTnha93=yua4g@mail.gmail.com>
	<CANtu0ohFr7OzNSbxqBhUpR0mXDYyt0Xt6+=Tbq0EC7as7kr+Lg@mail.gmail.com>
	<CANtu0oh4PwBn_h+4p_MxFigRAyJvF-0nA9Tm5NFRwfsWWjZQiA@mail.gmail.com>
	<CANtu0ojHEVU9U_bxgViRmtqNTJ92LnF+76-yzn4axYjGsK2kqQ@mail.gmail.com>
	<CANtu0ogS871NkdUnZW9P_LVpLzhSJ1+cETK0b55cYjs=v2qbPA@mail.gmail.com>
	<CANtu0ohRVBDf4x7Ge3oVzgf4NzMb_DhmTM1ae0u1WUA+CD0UqA@mail.gmail.com>
	<CANtu0ogTfyng-H4yWr3Pm_+PXX+XvDx1AM1sXTy1V7DM6jJ+Bw@mail.gmail.com>
	<CANtu0oi+nbipJUsMZcoUfodCyuTN_DAXD22UstjMTYWG=tJ4jw@mail.gmail.com>
	<CANtu0oiuUF7L0wTGxOHfumyoVge3n7C4rAjdmFo=efeEwobXbg@mail.gmail.com>
	<CANtu0oiD-AvXdygYqYP-WkFq=7vSL78Wj8UU-PUX+3huPNqroQ@mail.gmail.com>

Hello!

Rebased + snapshot resetting during validation + removed PROC_IN_SAFE_IC.
Going to do some benchmarks soon.

Best regards,
Mikhail.

>


Attachments:

  [application/octet-stream] v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch (35.1K, 3-v9-0005-Allow-snapshot-resets-in-concurrent-unique-index-.patch)
  download | inline diff:
From 86d498d18c232a62c4da4e5849258c1ab09f69b3 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 7 Dec 2024 23:27:34 +0100
Subject: [PATCH v9 5/9] Allow snapshot resets in concurrent unique index
 builds

Previously, concurrent unique index builds used a fixed snapshot for the entire
scan to ensure proper uniqueness checks. This could delay vacuum's ability to
clean up dead tuples.

Now reset snapshots periodically during concurrent unique index builds, while
still maintaining uniqueness by:

1. Ignoring dead tuples during uniqueness checks in tuplesort
2. Adding a uniqueness check in _bt_load that detects multiple alive tuples with the same key values

This improves vacuum effectiveness during long-running index builds without
compromising index uniqueness enforcement.
---
 src/backend/access/heap/README.HOT            |  12 +-
 src/backend/access/heap/heapam_handler.c      |   6 +-
 src/backend/access/nbtree/nbtdedup.c          |   8 +-
 src/backend/access/nbtree/nbtsort.c           | 173 ++++++++++++++----
 src/backend/access/nbtree/nbtsplitloc.c       |  12 +-
 src/backend/access/nbtree/nbtutils.c          |  29 ++-
 src/backend/catalog/index.c                   |   8 +-
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/utils/sort/tuplesortvariants.c    |  67 +++++--
 src/include/access/nbtree.h                   |   4 +-
 src/include/access/tableam.h                  |   5 +-
 src/include/utils/tuplesort.h                 |   1 +
 .../expected/cic_reset_snapshots.out          |   6 +
 13 files changed, 251 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 74e407f375a..829dad1194e 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -386,12 +386,12 @@ have the HOT-safety property enforced before we start to build the new
 index.
 
 After waiting for transactions which had the table open, we build the index
-for all rows that are valid in a fresh snapshot.  Any tuples visible in the
-snapshot will have only valid forward-growing HOT chains.  (They might have
-older HOT updates behind them which are broken, but this is OK for the same
-reason it's OK in a regular index build.)  As above, we point the index
-entry at the root of the HOT-update chain but we use the key value from the
-live tuple.
+for all rows that are valid in a fresh snapshot, which is updated every so
+often. Any tuples visible in the snapshot will have only valid forward-growing
+HOT chains.  (They might have older HOT updates behind them which are broken,
+but this is OK for the same reason it's OK in a regular index build.)
+As above, we point the index entry at the root of the HOT-update chain but we
+use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
 we again wait for transactions which have the table open.  Then we take
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 8144743c338..0f706553605 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1232,15 +1232,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
 	 * and index whatever's live according to that while that snapshot is reset
-	 * every so often (in case of non-unique index).
+	 * every so often.
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
-	 * For unique index we need consistent snapshot for the whole scan.
+	 * For concurrent builds of non-system indexes, we may want to periodically
+	 * reset snapshots to allow vacuum to clean up tuples.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
-					  !indexInfo->ii_Unique &&
 					  !is_system_catalog; /* just for the case */
 
 	/* okay to ignore lazy VACUUMs here */
diff --git a/src/backend/access/nbtree/nbtdedup.c b/src/backend/access/nbtree/nbtdedup.c
index 456d86b51c9..31b59265a29 100644
--- a/src/backend/access/nbtree/nbtdedup.c
+++ b/src/backend/access/nbtree/nbtdedup.c
@@ -148,7 +148,7 @@ _bt_dedup_pass(Relation rel, Buffer buf, IndexTuple newitem, Size newitemsz,
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
 		else if (state->deduplicate &&
-				 _bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+				 _bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/*
@@ -374,7 +374,7 @@ _bt_bottomupdel_pass(Relation rel, Buffer buf, Relation heapRel,
 			/* itup starts first pending interval */
 			_bt_dedup_start_pending(state, itup, offnum);
 		}
-		else if (_bt_keep_natts_fast(rel, state->base, itup) > nkeyatts &&
+		else if (_bt_keep_natts_fast(rel, state->base, itup, NULL) > nkeyatts &&
 				 _bt_dedup_save_htid(state, itup))
 		{
 			/* Tuple is equal; just added its TIDs to pending interval */
@@ -789,12 +789,12 @@ _bt_do_singleval(Relation rel, Page page, BTDedupState state,
 	itemid = PageGetItemId(page, minoff);
 	itup = (IndexTuple) PageGetItem(page, itemid);
 
-	if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+	if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 	{
 		itemid = PageGetItemId(page, PageGetMaxOffsetNumber(page));
 		itup = (IndexTuple) PageGetItem(page, itemid);
 
-		if (_bt_keep_natts_fast(rel, newitem, itup) > nkeyatts)
+		if (_bt_keep_natts_fast(rel, newitem, itup, NULL) > nkeyatts)
 			return true;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 783489600fc..38355601421 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -83,6 +83,7 @@ typedef struct BTSpool
 	Relation	index;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 } BTSpool;
 
 /*
@@ -101,6 +102,7 @@ typedef struct BTShared
 	Oid			indexrelid;
 	bool		isunique;
 	bool		nulls_not_distinct;
+	bool		unique_dead_ignored;
 	bool		isconcurrent;
 	int			scantuplesortstates;
 
@@ -203,15 +205,13 @@ typedef struct BTLeader
  */
 typedef struct BTBuildState
 {
-	bool		isunique;
-	bool		nulls_not_distinct;
 	bool		havedead;
 	Relation	heap;
 	BTSpool    *spool;
 
 	/*
-	 * spool2 is needed only when the index is a unique index. Dead tuples are
-	 * put into spool2 instead of spool in order to avoid uniqueness check.
+	 * spool2 is needed only when the index is a unique index and build non-concurrently.
+	 * Dead tuples are put into spool2 instead of spool in order to avoid uniqueness check.
 	 */
 	BTSpool    *spool2;
 	double		indtuples;
@@ -303,8 +303,6 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		ResetUsage();
 #endif							/* BTREE_BUILD_STATS */
 
-	buildstate.isunique = indexInfo->ii_Unique;
-	buildstate.nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
 	buildstate.havedead = false;
 	buildstate.heap = heap;
 	buildstate.spool = NULL;
@@ -379,6 +377,11 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	btspool->index = index;
 	btspool->isunique = indexInfo->ii_Unique;
 	btspool->nulls_not_distinct = indexInfo->ii_NullsNotDistinct;
+    /*
+     * We need to ignore dead tuples for unique checks in case of concurrent build.
+     * It is required because or periodic reset of snapshot.
+     */
+	btspool->unique_dead_ignored = indexInfo->ii_Concurrent && indexInfo->ii_Unique;
 
 	/* Save as primary spool */
 	buildstate->spool = btspool;
@@ -427,8 +430,9 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * the use of parallelism or any other factor.
 	 */
 	buildstate->spool->sortstate =
-		tuplesort_begin_index_btree(heap, index, buildstate->isunique,
-									buildstate->nulls_not_distinct,
+		tuplesort_begin_index_btree(heap, index, btspool->isunique,
+									btspool->nulls_not_distinct,
+									btspool->unique_dead_ignored,
 									maintenance_work_mem, coordinate,
 									TUPLESORT_NONE);
 
@@ -436,8 +440,12 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.  We expect that the second spool (for
 	 * dead tuples) won't get very full, so we give it only work_mem.
+	 *
+	 * In case of concurrent build dead tuples are not need to be put into index
+	 * since we wait for all snapshots older than reference snapshot during the
+	 * validation phase.
 	 */
-	if (indexInfo->ii_Unique)
+	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
 	{
 		BTSpool    *btspool2 = (BTSpool *) palloc0(sizeof(BTSpool));
 		SortCoordinate coordinate2 = NULL;
@@ -468,7 +476,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 		 * full, so we give it only work_mem
 		 */
 		buildstate->spool2->sortstate =
-			tuplesort_begin_index_btree(heap, index, false, false, work_mem,
+			tuplesort_begin_index_btree(heap, index, false, false, false, work_mem,
 										coordinate2, TUPLESORT_NONE);
 	}
 
@@ -1147,13 +1155,116 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
 	bool		deduplicate;
+	bool		fail_on_alive_duplicate;
 
 	wstate->bulkstate = smgr_bulk_start_rel(wstate->index, MAIN_FORKNUM);
 
 	deduplicate = wstate->inskey->allequalimage && !btspool->isunique &&
 		BTGetDeduplicateItems(wstate->index);
+	/*
+	 * The unique_dead_ignored does not guarantee absence of multiple alive
+	 * tuples with same values exists in the spool. Such thing may happen if
+	 * alive tuples are located between a few dead tuples, like this: addda.
+	 */
+	fail_on_alive_duplicate = btspool->unique_dead_ignored;
 
-	if (merge)
+	if (fail_on_alive_duplicate)
+	{
+		bool	seen_alive = false,
+				prev_tested = false;
+		IndexTuple prev = NULL;
+		TupleTableSlot 		*slot = MakeSingleTupleTableSlot(RelationGetDescr(wstate->heap),
+															   &TTSOpsBufferHeapTuple);
+		IndexFetchTableData *fetch = table_index_fetch_begin(wstate->heap);
+
+		Assert(btspool->isunique);
+		Assert(!btspool2);
+
+		while ((itup = tuplesort_getindextuple(btspool->sortstate, true)) != NULL)
+		{
+			bool	tuples_equal = false;
+
+			/* When we see first tuple, create first index page */
+			if (state == NULL)
+				state = _bt_pagestate(wstate, 0);
+
+			if (prev != NULL) /* if is not the first tuple */
+			{
+				bool	has_nulls = false,
+						call_again, /* just to pass something */
+						ignored,  /* just to pass something */
+						now_alive;
+				ItemPointerData tid;
+
+				/* if this tuples equal to previouse one? */
+				if (wstate->inskey->allequalimage)
+					tuples_equal = _bt_keep_natts_fast(wstate->index, prev, itup, &has_nulls) > keysz;
+				else
+					tuples_equal = _bt_keep_natts(wstate->index, prev, itup,wstate->inskey, &has_nulls) > keysz;
+
+				/* handle null values correctly */
+				if (has_nulls && !btspool->nulls_not_distinct)
+					tuples_equal = false;
+
+				if (tuples_equal)
+				{
+					/* check previous tuple if not yet */
+					if (!prev_tested)
+					{
+						call_again = false;
+						tid = prev->t_tid;
+						seen_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+						prev_tested = true;
+					}
+
+					call_again = false;
+					tid = itup->t_tid;
+					now_alive = table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+					/* are multiple alive tuples detected in equal group? */
+					if (seen_alive && now_alive)
+					{
+						char *key_desc;
+						TupleDesc tupDes = RelationGetDescr(wstate->index);
+						bool isnull[INDEX_MAX_KEYS];
+						Datum values[INDEX_MAX_KEYS];
+
+						index_deform_tuple(itup, tupDes, values, isnull);
+
+						key_desc = BuildIndexValueDescription(wstate->index, values, isnull);
+
+						/* keep this message in sync with the same in comparetup_index_btree_tiebreak */
+						ereport(ERROR,
+								(errcode(ERRCODE_UNIQUE_VIOLATION),
+										errmsg("could not create unique index \"%s\"",
+											   RelationGetRelationName(wstate->index)),
+										key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+										errdetail("Duplicate keys exist."),
+										errtableconstraint(wstate->heap,
+														   RelationGetRelationName(wstate->index))));
+					}
+					seen_alive |= now_alive;
+				}
+			}
+
+			if (!tuples_equal)
+			{
+				seen_alive = false;
+				prev_tested = false;
+			}
+
+			_bt_buildadd(wstate, state, itup, 0);
+			if (prev) pfree(prev);
+			prev = CopyIndexTuple(itup);
+
+			/* Report progress */
+			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+										 ++tuples_done);
+		}
+		ExecDropSingleTupleTableSlot(slot);
+		table_index_fetch_end(fetch);
+	}
+	else if (merge)
 	{
 		/*
 		 * Another BTSpool for dead tuples exists. Now we have to merge
@@ -1314,7 +1425,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 										InvalidOffsetNumber);
 			}
 			else if (_bt_keep_natts_fast(wstate->index, dstate->base,
-										 itup) > keysz &&
+										 itup, NULL) > keysz &&
 					 _bt_dedup_save_htid(dstate, itup))
 			{
 				/*
@@ -1411,7 +1522,6 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
-	bool		reset_snapshot;
 	bool		wait_for_snapshot_attach;
 	int			querylen;
 
@@ -1430,21 +1540,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
-    /*
-	 * For concurrent non-unique index builds, we can periodically reset snapshots
-	 * to allow the xmin horizon to advance. This is safe since these builds don't
-	 * require a consistent view across the entire scan. Unique indexes still need
-	 * a stable snapshot to properly enforce uniqueness constraints.
-     */
-	reset_snapshot = isconcurrent && !btspool->isunique;
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that, while that snapshot may be reset periodically in
-	 * case of non-unique index.
+	 * live according to that, while that snapshot may be reset periodically.
 	 */
 	if (!isconcurrent)
 	{
@@ -1452,16 +1553,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
-	else if (reset_snapshot)
+	else
 	{
+		/*
+		 * For concurrent index builds, we can periodically reset snapshots to allow
+		 * the xmin horizon to advance. This is safe since these builds don't
+		 * require a consistent view across the entire scan.
+		 */
 		snapshot = InvalidSnapshot;
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
-	else
-	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1531,6 +1632,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->indexrelid = RelationGetRelid(btspool->index);
 	btshared->isunique = btspool->isunique;
 	btshared->nulls_not_distinct = btspool->nulls_not_distinct;
+	btshared->unique_dead_ignored = btspool->unique_dead_ignored;
 	btshared->isconcurrent = isconcurrent;
 	btshared->scantuplesortstates = scantuplesortstates;
 	btshared->queryid = pgstat_get_my_query_id();
@@ -1545,7 +1647,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
 								  snapshot,
-								  reset_snapshot);
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1626,7 +1728,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * In case when leader going to reset own active snapshot as well - we need to
 	 * wait until all workers imported initial snapshot.
 	 */
-	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
 
 	if (wait_for_snapshot_attach)
 		WaitForParallelWorkersToAttach(pcxt, true);
@@ -1742,6 +1844,7 @@ _bt_leader_participate_as_worker(BTBuildState *buildstate)
 	leaderworker->index = buildstate->spool->index;
 	leaderworker->isunique = buildstate->spool->isunique;
 	leaderworker->nulls_not_distinct = buildstate->spool->nulls_not_distinct;
+	leaderworker->unique_dead_ignored = buildstate->spool->unique_dead_ignored;
 
 	/* Initialize second spool, if required */
 	if (!btleader->btshared->isunique)
@@ -1845,11 +1948,12 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	btspool->index = indexRel;
 	btspool->isunique = btshared->isunique;
 	btspool->nulls_not_distinct = btshared->nulls_not_distinct;
+	btspool->unique_dead_ignored = btshared->unique_dead_ignored;
 
 	/* Look up shared state private to tuplesort.c */
 	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
 	tuplesort_attach_shared(sharedsort, seg);
-	if (!btshared->isunique)
+	if (!btshared->isunique || btshared->isconcurrent)
 	{
 		btspool2 = NULL;
 		sharedsort2 = NULL;
@@ -1928,6 +2032,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 													 btspool->index,
 													 btspool->isunique,
 													 btspool->nulls_not_distinct,
+													 btspool->unique_dead_ignored,
 													 sortmem, coordinate,
 													 TUPLESORT_NONE);
 
@@ -1950,14 +2055,12 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2,
 		coordinate2->nParticipants = -1;
 		coordinate2->sharedsort = sharedsort2;
 		btspool2->sortstate =
-			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false,
+			tuplesort_begin_index_btree(btspool->heap, btspool->index, false, false, false,
 										Min(sortmem, work_mem), coordinate2,
 										false);
 	}
 
 	/* Fill in buildstate for _bt_build_callback() */
-	buildstate.isunique = btshared->isunique;
-	buildstate.nulls_not_distinct = btshared->nulls_not_distinct;
 	buildstate.havedead = false;
 	buildstate.heap = btspool->heap;
 	buildstate.spool = btspool;
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1f40d40263e..e2ed4537026 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -687,7 +687,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	{
 		itemid = PageGetItemId(state->origpage, maxoff);
 		tup = (IndexTuple) PageGetItem(state->origpage, itemid);
-		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+		keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 		if (keepnatts > 1 && keepnatts <= nkeyatts)
 		{
@@ -718,7 +718,7 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
-	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
+	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem, NULL);
 
 	if (keepnatts > 1 && keepnatts <= nkeyatts)
 	{
@@ -967,7 +967,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * avoid appending a heap TID in new high key, we're done.  Finish split
 	 * with default strategy and initial split interval.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 		return perfectpenalty;
 
@@ -988,7 +988,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 	 * If page is entirely full of duplicates, a single value strategy split
 	 * will be performed.
 	 */
-	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost);
+	perfectpenalty = _bt_keep_natts_fast(state->rel, leftmost, rightmost, NULL);
 	if (perfectpenalty <= indnkeyatts)
 	{
 		*strategy = SPLIT_MANY_DUPLICATES;
@@ -1027,7 +1027,7 @@ _bt_strategy(FindSplitData *state, SplitPoint *leftpage,
 		itemid = PageGetItemId(state->origpage, P_HIKEY);
 		hikey = (IndexTuple) PageGetItem(state->origpage, itemid);
 		perfectpenalty = _bt_keep_natts_fast(state->rel, hikey,
-											 state->newitem);
+											 state->newitem, NULL);
 		if (perfectpenalty <= indnkeyatts)
 			*strategy = SPLIT_SINGLE_VALUE;
 		else
@@ -1149,7 +1149,7 @@ _bt_split_penalty(FindSplitData *state, SplitPoint *split)
 	lastleft = _bt_split_lastleft(state, split);
 	firstright = _bt_split_firstright(state, split);
 
-	return _bt_keep_natts_fast(state->rel, lastleft, firstright);
+	return _bt_keep_natts_fast(state->rel, lastleft, firstright, NULL);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index a531d37908a..e729b4a4d7c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -100,8 +100,6 @@ static bool _bt_check_rowcompare(ScanKey skey,
 								 ScanDirection dir, bool *continuescan);
 static void _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
 									 int tupnatts, TupleDesc tupdesc);
-static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
-						   IndexTuple firstright, BTScanInsert itup_key);
 
 
 /*
@@ -4676,7 +4674,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	Assert(!BTreeTupleIsPivot(lastleft) && !BTreeTupleIsPivot(firstright));
 
 	/* Determine how many attributes must be kept in truncated tuple */
-	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key);
+	keepnatts = _bt_keep_natts(rel, lastleft, firstright, itup_key, NULL);
 
 #ifdef DEBUG_NO_TRUNCATE
 	/* Force truncation to be ineffective for testing purposes */
@@ -4794,17 +4792,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 /*
  * _bt_keep_natts - how many key attributes to keep when truncating.
  *
+ * This is exported to be used as comparison function during concurrent
+ * unique index build in case _bt_keep_natts_fast is not suitable because
+ * collation is not "allequalimage"/deduplication-safe.
+ *
  * Caller provides two tuples that enclose a split point.  Caller's insertion
  * scankey is used to compare the tuples; the scankey's argument values are
  * not considered here.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * This can return a number of attributes that is one greater than the
  * number of key attributes for the index relation.  This indicates that the
  * caller must use a heap TID as a unique-ifier in new pivot tuple.
  */
-static int
+int
 _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
-			   BTScanInsert itup_key)
+			   BTScanInsert itup_key,
+			   bool *hasnulls)
 {
 	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	TupleDesc	itupdesc = RelationGetDescr(rel);
@@ -4830,6 +4835,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			(*hasnulls) |= (isNull1 || isNull2);
 
 		if (isNull1 != isNull2)
 			break;
@@ -4849,7 +4856,7 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * expected in an allequalimage index.
 	 */
 	Assert(!itup_key->allequalimage ||
-		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright));
+		   keepnatts == _bt_keep_natts_fast(rel, lastleft, firstright, NULL));
 
 	return keepnatts;
 }
@@ -4860,7 +4867,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * This is exported so that a candidate split point can have its effect on
  * suffix truncation inexpensively evaluated ahead of time when finding a
  * split location.  A naive bitwise approach to datum comparisons is used to
- * save cycles.
+ * save cycles. Also, it may be used as comparison function during concurrent
+ * build of unique index.
  *
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
@@ -4869,6 +4877,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * "equal image" columns, routine is guaranteed to give the same result as
  * _bt_keep_natts would.
  *
+ * hasnulls value set to true in case of any null column in any tuple.
+ *
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts, even when the index uses
  * an opclass or collation that is not "allequalimage"/deduplication-safe.
@@ -4877,7 +4887,8 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * more balanced split point.
  */
 int
-_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
+_bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+					bool *hasnulls)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	int			keysz = IndexRelationGetNumberOfKeyAttributes(rel);
@@ -4894,6 +4905,8 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 
 		datum1 = index_getattr(lastleft, attnum, itupdesc, &isNull1);
 		datum2 = index_getattr(firstright, attnum, itupdesc, &isNull2);
+		if (hasnulls)
+			*hasnulls |= (isNull1 | isNull2);
 		att = TupleDescCompactAttr(itupdesc, attnum - 1);
 
 		if (isNull1 != isNull2)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index fcb6e940ff2..73454accf61 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3293,9 +3293,9 @@ IndexCheckExclusion(Relation heapRelation,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
- * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
- * scan, which causes new snapshot to be set as active every so often. The reason
- * for that is to propagate the xmin horizon forward.
+ * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
+ * snapshot to be set as active every so often. The reason  for that is to
+ * propagate the xmin horizon forward.
  *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 6c1fce8ed25..a02729911fe 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,8 +1670,8 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using single or
-	 * multiple refreshing snapshots. We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using multiple
+	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index e07ba4ea4b1..aa4fcaac9a0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -123,6 +123,7 @@ typedef struct
 
 	bool		enforceUnique;	/* complain if we find duplicate tuples */
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
+	bool		uniqueDeadIgnored; /* ignore dead tuples in unique check */
 } TuplesortIndexBTreeArg;
 
 /*
@@ -349,6 +350,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
 							bool uniqueNullsNotDistinct,
+							bool uniqueDeadIgnored,
 							int workMem,
 							SortCoordinate coordinate,
 							int sortopt)
@@ -391,6 +393,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	arg->index.indexRel = indexRel;
 	arg->enforceUnique = enforceUnique;
 	arg->uniqueNullsNotDistinct = uniqueNullsNotDistinct;
+	arg->uniqueDeadIgnored = uniqueDeadIgnored;
 
 	indexScanKey = _bt_mkscankey(indexRel, NULL);
 
@@ -1520,6 +1523,7 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		Datum		values[INDEX_MAX_KEYS];
 		bool		isnull[INDEX_MAX_KEYS];
 		char	   *key_desc;
+		bool		uniqueCheckFail = true;
 
 		/*
 		 * Some rather brain-dead implementations of qsort (such as the one in
@@ -1529,18 +1533,57 @@ comparetup_index_btree_tiebreak(const SortTuple *a, const SortTuple *b,
 		 */
 		Assert(tuple1 != tuple2);
 
-		index_deform_tuple(tuple1, tupDes, values, isnull);
-
-		key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
-
-		ereport(ERROR,
-				(errcode(ERRCODE_UNIQUE_VIOLATION),
-				 errmsg("could not create unique index \"%s\"",
-						RelationGetRelationName(arg->index.indexRel)),
-				 key_desc ? errdetail("Key %s is duplicated.", key_desc) :
-				 errdetail("Duplicate keys exist."),
-				 errtableconstraint(arg->index.heapRel,
-									RelationGetRelationName(arg->index.indexRel))));
+		/* This is fail-fast check, see _bt_load for details. */
+		if (arg->uniqueDeadIgnored)
+		{
+			bool	any_tuple_dead,
+					call_again = false,
+					ignored;
+
+			TupleTableSlot	*slot = MakeSingleTupleTableSlot(RelationGetDescr(arg->index.heapRel),
+															 &TTSOpsBufferHeapTuple);
+			ItemPointerData tid = tuple1->t_tid;
+
+			IndexFetchTableData *fetch = table_index_fetch_begin(arg->index.heapRel);
+			any_tuple_dead = !table_index_fetch_tuple(fetch, &tid, SnapshotSelf, slot, &call_again, &ignored);
+
+			if (!any_tuple_dead)
+			{
+				call_again = false;
+				tid = tuple2->t_tid;
+				any_tuple_dead = !table_index_fetch_tuple(fetch, &tuple2->t_tid, SnapshotSelf, slot, &call_again,
+														  &ignored);
+			}
+
+			if (any_tuple_dead)
+			{
+				elog(DEBUG5, "skipping duplicate values because some of them are dead: (%u,%u) vs (%u,%u)",
+					 ItemPointerGetBlockNumber(&tuple1->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple1->t_tid),
+					 ItemPointerGetBlockNumber(&tuple2->t_tid),
+					 ItemPointerGetOffsetNumber(&tuple2->t_tid));
+
+				uniqueCheckFail = false;
+			}
+			ExecDropSingleTupleTableSlot(slot);
+			table_index_fetch_end(fetch);
+		}
+		if (uniqueCheckFail)
+		{
+			index_deform_tuple(tuple1, tupDes, values, isnull);
+
+			key_desc = BuildIndexValueDescription(arg->index.indexRel, values, isnull);
+
+			/* keep this error message in sync with the same in _bt_load */
+			ereport(ERROR,
+					(errcode(ERRCODE_UNIQUE_VIOLATION),
+							errmsg("could not create unique index \"%s\"",
+								   RelationGetRelationName(arg->index.indexRel)),
+							key_desc ? errdetail("Key %s is duplicated.", key_desc) :
+							errdetail("Duplicate keys exist."),
+							errtableconstraint(arg->index.heapRel,
+											   RelationGetRelationName(arg->index.indexRel))));
+		}
 	}
 
 	/*
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 123fba624db..4200d2bd20e 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1297,8 +1297,10 @@ extern bool btproperty(Oid index_oid, int attno,
 extern char *btbuildphasename(int64 phasenum);
 extern IndexTuple _bt_truncate(Relation rel, IndexTuple lastleft,
 							   IndexTuple firstright, BTScanInsert itup_key);
+extern int	_bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
+						   BTScanInsert itup_key, bool *hasnulls);
 extern int	_bt_keep_natts_fast(Relation rel, IndexTuple lastleft,
-								IndexTuple firstright);
+								IndexTuple firstright, bool *hasnulls);
 extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 66e1ad83f1a..0ecc3147bbd 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1799,9 +1799,8 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
- * for the scan. That leads for changing snapshots on the fly to allow xmin
- * horizon propagate.
+ * In case of concurrent index build SO_RESET_SNAPSHOT is applied for the scan.
+ * That leads for changing snapshots on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..ae5f4d28fdc 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -428,6 +428,7 @@ extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 												   Relation indexRel,
 												   bool enforceUnique,
 												   bool uniqueNullsNotDistinct,
+												   bool	uniqueDeadIgnored,
 												   int workMem, SortCoordinate coordinate,
 												   int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 595a4000ce0..9f03fa3033c 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -41,7 +41,11 @@ END; $$;
 ----------------
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -86,7 +90,9 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 (1 row)
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
-- 
2.43.0



  [application/octet-stream] v9-0002-Add-stress-tests-for-concurrent-index-operations.patch (6.5K, 4-v9-0002-Add-stress-tests-for-concurrent-index-operations.patch)
  download | inline diff:
From 23c3c9f06ca446f1b2840c18e511a11c827cbc14 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 16:24:20 +0100
Subject: [PATCH v9 2/9] Add stress tests for concurrent index operations

Add comprehensive stress tests for concurrent index operations, focusing on:
* Testing CREATE/REINDEX/DROP INDEX CONCURRENTLY under heavy write load
* Verifying index integrity during concurrent HOT updates
* Testing various index types including unique and partial indexes
* Validating index correctness using amcheck
* Exercising parallel worker configurations

These stress tests help ensure reliability of concurrent index operations
under heavy load conditions.
---
 src/bin/pg_amcheck/meson.build  |   1 +
 src/bin/pg_amcheck/t/006_cic.pl | 144 ++++++++++++++++++++++++++++++++
 2 files changed, 145 insertions(+)
 create mode 100644 src/bin/pg_amcheck/t/006_cic.pl

diff --git a/src/bin/pg_amcheck/meson.build b/src/bin/pg_amcheck/meson.build
index 292b33eb094..4a8f4fbc8b0 100644
--- a/src/bin/pg_amcheck/meson.build
+++ b/src/bin/pg_amcheck/meson.build
@@ -28,6 +28,7 @@ tests += {
       't/003_check.pl',
       't/004_verify_heapam.pl',
       't/005_opclass_damage.pl',
+      't/006_cic.pl',
     ],
   },
 }
diff --git a/src/bin/pg_amcheck/t/006_cic.pl b/src/bin/pg_amcheck/t/006_cic.pl
new file mode 100644
index 00000000000..142e8fb845e
--- /dev/null
+++ b/src/bin/pg_amcheck/t/006_cic.pl
@@ -0,0 +1,144 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+# Test REINDEX CONCURRENTLY with concurrent modifications and HOT updates
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+Test::More->builder->todo_start('filesystem bug')
+  if PostgreSQL::Test::Utils::has_wal_read_bug;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('RC_test');
+$node->init;
+$node->append_conf('postgresql.conf',
+	'lock_timeout = ' . (1000 * $PostgreSQL::Test::Utils::timeout_default));
+$node->append_conf('postgresql.conf', 'fsync = off');
+$node->start;
+$node->safe_psql('postgres', q(CREATE EXTENSION amcheck));
+$node->safe_psql('postgres', q(CREATE TABLE tbl(i int primary key,
+								c1 money default 0, c2 money default 0,
+								c3 money default 0, updated_at timestamp)));
+$node->safe_psql('postgres', q(CREATE INDEX CONCURRENTLY idx ON tbl(i, updated_at);));
+# create sequence
+$node->safe_psql('postgres', q(CREATE UNLOGGED SEQUENCE in_row_rebuild START 1 INCREMENT 1;));
+$node->safe_psql('postgres', q(SELECT nextval('in_row_rebuild');));
+
+# Create helper functions for predicate tests
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_stable() RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		EXECUTE 'SELECT txid_current()';
+		RETURN true;
+	END; $$;
+));
+
+$node->safe_psql('postgres', q(
+	CREATE FUNCTION predicate_const(integer) RETURNS bool IMMUTABLE
+	LANGUAGE plpgsql AS $$
+	BEGIN
+		RETURN MOD($1, 2) = 0;
+	END; $$;
+));
+
+# Run CIC/RIC in different options concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set variant random(0, 5)
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					\if :variant = 0
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at);
+					\elif :variant = 1
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_stable();
+					\elif :variant = 2
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE MOD(i, 2) = 0;
+					\elif :variant = 3
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, updated_at) WHERE predicate_const(i);
+					\elif :variant = 4
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(predicate_const(i));
+					\elif :variant = 5
+						CREATE INDEX CONCURRENTLY idx_2 ON tbl(i, predicate_const(i), updated_at) WHERE predicate_const(i);
+					\endif
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1000, 100000)
+				BEGIN;
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+				COMMIT;
+			\endif
+		)
+	});
+
+$node->safe_psql('postgres', q(TRUNCATE TABLE tbl;));
+
+# Run CIC/RIC for unique index concurrently with upserts
+$node->pgbench(
+	'--no-vacuum --client=30 --jobs=4 --exit-on-abort --transactions=2500',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'concurrent operations with REINDEX/CREATE INDEX CONCURRENTLY',
+	{
+		'concurrent_ops_unique_idx' => q(
+			SELECT pg_try_advisory_lock(42)::integer AS gotlock \gset
+			\if :gotlock
+				SELECT nextval('in_row_rebuild') AS last_value \gset
+				\set parallels random(0, 4)
+				\if :last_value < 3
+					ALTER TABLE tbl SET (parallel_workers=:parallels);
+					CREATE UNIQUE INDEX CONCURRENTLY idx_2 ON tbl(i);
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					REINDEX INDEX CONCURRENTLY idx_2;
+					\sleep 10 ms
+					SELECT bt_index_check('idx_2', heapallindexed => true, checkunique => true);
+					DROP INDEX CONCURRENTLY idx_2;
+				\endif
+				SELECT pg_advisory_unlock(42);
+			\else
+				\set num random(1, power(10, random(1, 5)))
+				INSERT INTO tbl VALUES(floor(random()*:num),0,0,0,now())
+					ON CONFLICT(i) DO UPDATE SET updated_at = now();
+				SELECT setval('in_row_rebuild', 1);
+			\endif
+		)
+	});
+
+$node->stop;
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch (30.1K, 5-v9-0004-Allow-snapshot-resets-during-parallel-concurrent-.patch)
  download | inline diff:
From 43662a22363ddab775ec4373711be0cf39bcc1be Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Mon, 2 Dec 2024 01:33:21 +0100
Subject: [PATCH v9 4/9] Allow snapshot resets during parallel concurrent index
 builds

Previously, non-unique concurrent index builds in parallel mode required a
consistent MVCC snapshot throughout the build, which could hold back the xmin
horizon and prevent dead tuple cleanup. This patch extends the previous work
on snapshot resets (introduced for non-parallel builds) to also support
parallel builds.

Key changes:
- Add infrastructure to track snapshot restoration in parallel workers
- Extend parallel scan initialization to support periodic snapshot resets
- Wait for parallel workers to restore their initial snapshots before
  proceeding with scan
- Add regression tests to verify behavior with various index types

The snapshot reset approach is safe for non-unique indexes since they don't
need snapshot consistency across the entire scan. For unique indexes, we
continue to maintain a consistent snapshot to properly enforce uniqueness
constraints.

This helps reduce the xmin horizon impact of long-running concurrent index
builds in parallel mode, improving VACUUM's ability to clean up dead tuples.
---
 src/backend/access/brin/brin.c                | 43 +++++++++-------
 src/backend/access/heap/heapam_handler.c      | 12 +++--
 src/backend/access/nbtree/nbtsort.c           | 38 ++++++++++++--
 src/backend/access/table/tableam.c            | 37 ++++++++++++--
 src/backend/access/transam/parallel.c         | 50 +++++++++++++++++--
 src/backend/catalog/index.c                   |  2 +-
 src/backend/executor/nodeSeqscan.c            |  3 +-
 src/backend/utils/time/snapmgr.c              |  8 ---
 src/include/access/parallel.h                 |  3 +-
 src/include/access/relscan.h                  |  1 +
 src/include/access/tableam.h                  |  9 ++--
 .../expected/cic_reset_snapshots.out          | 23 ++++++++-
 .../sql/cic_reset_snapshots.sql               |  7 ++-
 13 files changed, 179 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d80394766d5..f076cedcc2c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -143,7 +143,6 @@ typedef struct BrinLeader
 	 */
 	BrinShared *brinshared;
 	Sharedsort *sharedsort;
-	Snapshot	snapshot;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 } BrinLeader;
@@ -231,7 +230,7 @@ static void brin_fill_empty_ranges(BrinBuildState *state,
 static void _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 								 bool isconcurrent, int request);
 static void _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state);
-static Size _brin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static Size _brin_parallel_estimate_shared(Relation heap);
 static double _brin_parallel_heapscan(BrinBuildState *state);
 static double _brin_parallel_merge(BrinBuildState *state);
 static void _brin_leader_participate_as_worker(BrinBuildState *buildstate,
@@ -2357,7 +2356,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 {
 	ParallelContext *pcxt;
 	int			scantuplesortstates;
-	Snapshot	snapshot;
 	Size		estbrinshared;
 	Size		estsort;
 	BrinShared *brinshared;
@@ -2367,6 +2365,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2388,25 +2387,25 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
-	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * concurrent build, we take a regular MVCC snapshot and push it as active.
+	 * Later we index whatever's live according to that snapshot while that
+	 * snapshot is reset periodically.
 	 */
 	if (!isconcurrent)
 	{
 		Assert(ActiveSnapshotSet());
-		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
 	else
 	{
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		Assert(!ActiveSnapshotSet());
 		PushActiveSnapshot(GetTransactionSnapshot());
 	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
 	 */
-	estbrinshared = _brin_parallel_estimate_shared(heap, snapshot);
+	estbrinshared = _brin_parallel_estimate_shared(heap);
 	shm_toc_estimate_chunk(&pcxt->estimator, estbrinshared);
 	estsort = tuplesort_estimate_shared(scantuplesortstates);
 	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
@@ -2446,8 +2445,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
-			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
 		return;
@@ -2472,7 +2469,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromBrinShared(brinshared),
-								  snapshot);
+								  isconcurrent ? InvalidSnapshot : SnapshotAny,
+								  isconcurrent);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -2518,7 +2516,6 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 		brinleader->nparticipanttuplesorts++;
 	brinleader->brinshared = brinshared;
 	brinleader->sharedsort = sharedsort;
-	brinleader->snapshot = snapshot;
 	brinleader->walusage = walusage;
 	brinleader->bufferusage = bufferusage;
 
@@ -2534,6 +2531,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->bs_leader = brinleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = isconcurrent && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_brin_leader_participate_as_worker(buildstate, heap, index);
@@ -2542,7 +2549,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -2565,9 +2573,6 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
 	for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
 		InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
 
-	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(brinleader->snapshot))
-		UnregisterSnapshot(brinleader->snapshot);
 	DestroyParallelContext(brinleader->pcxt);
 	ExitParallelMode();
 }
@@ -2767,14 +2772,14 @@ _brin_parallel_merge(BrinBuildState *state)
 
 /*
  * Returns size of shared memory required to store state for a parallel
- * brin index build based on the snapshot its parallel scan will use.
+ * brin index build.
  */
 static Size
-_brin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+_brin_parallel_estimate_shared(Relation heap)
 {
 	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
 	return add_size(BUFFERALIGN(sizeof(BrinShared)),
-					table_parallelscan_estimate(heap, snapshot));
+					table_parallelscan_estimate(heap, InvalidSnapshot));
 }
 
 /*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index d9fce07e8ad..8144743c338 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1231,14 +1231,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples). In a
 	 * concurrent build, or during bootstrap, we take a regular MVCC snapshot
-	 * and index whatever's live according to that.
+	 * and index whatever's live according to that while that snapshot is reset
+	 * every so often (in case of non-unique index).
 	 */
 	OldestXmin = InvalidTransactionId;
 
 	/*
 	 * For unique index we need consistent snapshot for the whole scan.
-	 * In case of parallel scan some additional infrastructure required
-	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
 	 */
 	reset_snapshots = indexInfo->ii_Concurrent &&
 					  !indexInfo->ii_Unique &&
@@ -1300,8 +1299,11 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
-		PushActiveSnapshot(snapshot);
-		need_pop_active_snapshot = true;
+		if (!reset_snapshots)
+		{
+			PushActiveSnapshot(snapshot);
+			need_pop_active_snapshot = true;
+		}
 	}
 
 	hscan = (HeapScanDesc) scan;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8647422ed05..783489600fc 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1411,6 +1411,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
 	bool		need_pop_active_snapshot = true;
+	bool		reset_snapshot;
+	bool		wait_for_snapshot_attach;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1428,12 +1430,21 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 
 	scantuplesortstates = leaderparticipates ? request + 1 : request;
 
+    /*
+	 * For concurrent non-unique index builds, we can periodically reset snapshots
+	 * to allow the xmin horizon to advance. This is safe since these builds don't
+	 * require a consistent view across the entire scan. Unique indexes still need
+	 * a stable snapshot to properly enforce uniqueness constraints.
+     */
+	reset_snapshot = isconcurrent && !btspool->isunique;
+
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
 	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
 	 * concurrent build, we take a regular MVCC snapshot and index whatever's
-	 * live according to that.
+	 * live according to that, while that snapshot may be reset periodically in
+	 * case of non-unique index.
 	 */
 	if (!isconcurrent)
 	{
@@ -1441,6 +1452,11 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 		snapshot = SnapshotAny;
 		need_pop_active_snapshot = false;
 	}
+	else if (reset_snapshot)
+	{
+		snapshot = InvalidSnapshot;
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 	else
 	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
@@ -1501,7 +1517,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	{
 		if (need_pop_active_snapshot)
 			PopActiveSnapshot();
-		if (IsMVCCSnapshot(snapshot))
+		if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
 		ExitParallelMode();
@@ -1528,7 +1544,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	btshared->brokenhotchain = false;
 	table_parallelscan_initialize(btspool->heap,
 								  ParallelTableScanFromBTShared(btshared),
-								  snapshot);
+								  snapshot,
+								  reset_snapshot);
 
 	/*
 	 * Store shared tuplesort-private state, for which we reserved space.
@@ -1604,6 +1621,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->btleader = btleader;
 
+	/*
+	 * In case of concurrent build snapshots are going to be reset periodically.
+	 * In case when leader going to reset own active snapshot as well - we need to
+	 * wait until all workers imported initial snapshot.
+	 */
+	wait_for_snapshot_attach = reset_snapshot && leaderparticipates;
+
+	if (wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, true);
+
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
 		_bt_leader_participate_as_worker(buildstate);
@@ -1612,7 +1639,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * Caller needs to wait for all launched workers when we return.  Make
 	 * sure that the failure-to-start case will not hang forever.
 	 */
-	WaitForParallelWorkersToAttach(pcxt);
+	if (!wait_for_snapshot_attach)
+		WaitForParallelWorkersToAttach(pcxt, false);
 	if (need_pop_active_snapshot)
 		PopActiveSnapshot();
 }
@@ -1636,7 +1664,7 @@ _bt_end_parallel(BTLeader *btleader)
 		InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
 
 	/* Free last reference to MVCC snapshot, if one was used */
-	if (IsMVCCSnapshot(btleader->snapshot))
+	if (btleader->snapshot != InvalidSnapshot && IsMVCCSnapshot(btleader->snapshot))
 		UnregisterSnapshot(btleader->snapshot);
 	DestroyParallelContext(btleader->pcxt);
 	ExitParallelMode();
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index bd8715b6797..cac7a9ea88a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -131,10 +131,10 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 {
 	Size		sz = 0;
 
-	if (IsMVCCSnapshot(snapshot))
+	if (snapshot != InvalidSnapshot && IsMVCCSnapshot(snapshot))
 		sz = add_size(sz, EstimateSnapshotSpace(snapshot));
 	else
-		Assert(snapshot == SnapshotAny);
+		Assert(snapshot == SnapshotAny || snapshot == InvalidSnapshot);
 
 	sz = add_size(sz, rel->rd_tableam->parallelscan_estimate(rel));
 
@@ -143,21 +143,36 @@ table_parallelscan_estimate(Relation rel, Snapshot snapshot)
 
 void
 table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan,
-							  Snapshot snapshot)
+							  Snapshot snapshot, bool reset_snapshot)
 {
 	Size		snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);
 
 	pscan->phs_snapshot_off = snapshot_off;
 
-	if (IsMVCCSnapshot(snapshot))
+	/*
+	 * Initialize parallel scan description. For normal scans with a regular
+	 * MVCC snapshot, serialize the snapshot info. For scans that use periodic
+	 * snapshot resets, mark the scan accordingly.
+	 */
+	if (reset_snapshot)
+	{
+		Assert(snapshot == InvalidSnapshot);
+		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = true;
+		INJECTION_POINT("table_parallelscan_initialize");
+	}
+	else if (IsMVCCSnapshot(snapshot))
 	{
 		SerializeSnapshot(snapshot, (char *) pscan + pscan->phs_snapshot_off);
 		pscan->phs_snapshot_any = false;
+		pscan->phs_reset_snapshot = false;
 	}
 	else
 	{
 		Assert(snapshot == SnapshotAny);
+		Assert(!reset_snapshot);
 		pscan->phs_snapshot_any = true;
+		pscan->phs_reset_snapshot = false;
 	}
 }
 
@@ -170,7 +185,19 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan)
 
 	Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator));
 
-	if (!pscan->phs_snapshot_any)
+	/*
+	 * For scans that
+	 * use periodic snapshot resets, mark the scan accordingly and use the active
+	 * snapshot as the initial state.
+	 */
+	if (pscan->phs_reset_snapshot)
+	{
+		Assert(ActiveSnapshotSet());
+		flags |= SO_RESET_SNAPSHOT;
+		/* Start with current active snapshot. */
+		snapshot = GetActiveSnapshot();
+	}
+	else if (!pscan->phs_snapshot_any)
 	{
 		/* Snapshot was serialized -- restore it */
 		snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off);
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 0a1e089ec1d..d49c6ee410f 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -76,6 +76,7 @@
 #define PARALLEL_KEY_RELMAPPER_STATE		UINT64CONST(0xFFFFFFFFFFFF000D)
 #define PARALLEL_KEY_UNCOMMITTEDENUMS		UINT64CONST(0xFFFFFFFFFFFF000E)
 #define PARALLEL_KEY_CLIENTCONNINFO			UINT64CONST(0xFFFFFFFFFFFF000F)
+#define PARALLEL_KEY_SNAPSHOT_RESTORED		UINT64CONST(0xFFFFFFFFFFFF0010)
 
 /* Fixed-size parallel state. */
 typedef struct FixedParallelState
@@ -301,6 +302,10 @@ InitializeParallelDSM(ParallelContext *pcxt)
 										pcxt->nworkers));
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+		shm_toc_estimate_chunk(&pcxt->estimator, mul_size(sizeof(bool),
+							   pcxt->nworkers));
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+
 		/* Estimate how much we'll need for the entrypoint info. */
 		shm_toc_estimate_chunk(&pcxt->estimator, strlen(pcxt->library_name) +
 							   strlen(pcxt->function_name) + 2);
@@ -372,6 +377,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		char	   *entrypointstate;
 		char	   *uncommittedenumsspace;
 		char	   *clientconninfospace;
+		bool	   *snapshot_set_flag_space;
 		Size		lnamelen;
 
 		/* Serialize shared libraries we have loaded. */
@@ -487,6 +493,19 @@ InitializeParallelDSM(ParallelContext *pcxt)
 		strcpy(entrypointstate, pcxt->library_name);
 		strcpy(entrypointstate + lnamelen + 1, pcxt->function_name);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_ENTRYPOINT, entrypointstate);
+
+		/*
+		 * Establish dynamic shared memory to pass information about importing
+		 * of snapshot.
+		 */
+		snapshot_set_flag_space =
+				shm_toc_allocate(pcxt->toc, mul_size(sizeof(bool), pcxt->nworkers));
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			pcxt->worker[i].snapshot_restored = snapshot_set_flag_space + i * sizeof(bool);
+			*pcxt->worker[i].snapshot_restored = false;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, snapshot_set_flag_space);
 	}
 
 	/* Update nworkers_to_launch, in case we changed nworkers above. */
@@ -542,6 +561,17 @@ ReinitializeParallelDSM(ParallelContext *pcxt)
 			pcxt->worker[i].error_mqh = shm_mq_attach(mq, pcxt->seg, NULL);
 		}
 	}
+
+	/* Set snapshot restored flag to false. */
+	if (pcxt->nworkers > 0)
+	{
+		bool	   *snapshot_restored_space;
+		int			i;
+		snapshot_restored_space =
+				shm_toc_lookup(pcxt->toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+		for (i = 0; i < pcxt->nworkers; ++i)
+			snapshot_restored_space[i] = false;
+	}
 }
 
 /*
@@ -657,6 +687,10 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * Wait for all workers to attach to their error queues, and throw an error if
  * any worker fails to do this.
  *
+ * wait_for_snapshot: track whether each parallel worker has successfully restored
+ * its snapshot. This is needed when using periodic snapshot resets to ensure all
+ * workers have a valid initial snapshot before proceeding with the scan.
+ *
  * Callers can assume that if this function returns successfully, then the
  * number of workers given by pcxt->nworkers_launched have initialized and
  * attached to their error queues.  Whether or not these workers are guaranteed
@@ -686,7 +720,7 @@ LaunchParallelWorkers(ParallelContext *pcxt)
  * call this function at all.
  */
 void
-WaitForParallelWorkersToAttach(ParallelContext *pcxt)
+WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot)
 {
 	int			i;
 
@@ -730,9 +764,12 @@ WaitForParallelWorkersToAttach(ParallelContext *pcxt)
 				mq = shm_mq_get_queue(pcxt->worker[i].error_mqh);
 				if (shm_mq_get_sender(mq) != NULL)
 				{
-					/* Yes, so it is known to be attached. */
-					pcxt->known_attached_workers[i] = true;
-					++pcxt->nknown_attached_workers;
+					if (!wait_for_snapshot || *(pcxt->worker[i].snapshot_restored))
+					{
+						/* Yes, so it is known to be attached. */
+						pcxt->known_attached_workers[i] = true;
+						++pcxt->nknown_attached_workers;
+					}
 				}
 			}
 			else if (status == BGWH_STOPPED)
@@ -1291,6 +1328,7 @@ ParallelWorkerMain(Datum main_arg)
 	shm_toc    *toc;
 	FixedParallelState *fps;
 	char	   *error_queue_space;
+	bool	   *snapshot_restored_space;
 	shm_mq	   *mq;
 	shm_mq_handle *mqh;
 	char	   *libraryspace;
@@ -1489,6 +1527,10 @@ ParallelWorkerMain(Datum main_arg)
 							   fps->parallel_leader_pgproc);
 	PushActiveSnapshot(asnapshot);
 
+	/* Snapshot is restored, set flag to make leader know about it. */
+	snapshot_restored_space = shm_toc_lookup(toc, PARALLEL_KEY_SNAPSHOT_RESTORED, false);
+	snapshot_restored_space[ParallelWorkerNumber] = true;
+
 	/*
 	 * We've changed which tuples we can see, and must therefore invalidate
 	 * system caches.
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c5a900f1b29..fcb6e940ff2 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1531,7 +1531,7 @@ index_concurrently_build(Oid heapRelationId,
 
 	/* Invalidate catalog snapshot just for assert */
 	InvalidateCatalogSnapshot();
-	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+	Assert(indexInfo->ii_Unique || !TransactionIdIsValid(MyProc->xmin));
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 7cb12a11c2d..2907b366791 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -262,7 +262,8 @@ ExecSeqScanInitializeDSM(SeqScanState *node,
 	pscan = shm_toc_allocate(pcxt->toc, node->pscan_len);
 	table_parallelscan_initialize(node->ss.ss_currentRelation,
 								  pscan,
-								  estate->es_snapshot);
+								  estate->es_snapshot,
+								  false);
 	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan);
 	node->ss.ss_currentScanDesc =
 		table_beginscan_parallel(node->ss.ss_currentRelation, pscan);
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 101a02c5b60..153ac28db3e 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -283,14 +283,6 @@ GetTransactionSnapshot(void)
 Snapshot
 GetLatestSnapshot(void)
 {
-	/*
-	 * We might be able to relax this, but nothing that could otherwise work
-	 * needs it.
-	 */
-	if (IsInParallelMode())
-		elog(ERROR,
-			 "cannot update SecondarySnapshot during a parallel operation");
-
 	/*
 	 * So far there are no cases requiring support for GetLatestSnapshot()
 	 * during logical decoding, but it wouldn't be hard to add if required.
diff --git a/src/include/access/parallel.h b/src/include/access/parallel.h
index 69ffe5498f9..964a7e945be 100644
--- a/src/include/access/parallel.h
+++ b/src/include/access/parallel.h
@@ -26,6 +26,7 @@ typedef struct ParallelWorkerInfo
 {
 	BackgroundWorkerHandle *bgwhandle;
 	shm_mq_handle *error_mqh;
+	bool		  *snapshot_restored;
 } ParallelWorkerInfo;
 
 typedef struct ParallelContext
@@ -65,7 +66,7 @@ extern void InitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelDSM(ParallelContext *pcxt);
 extern void ReinitializeParallelWorkers(ParallelContext *pcxt, int nworkers_to_launch);
 extern void LaunchParallelWorkers(ParallelContext *pcxt);
-extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt);
+extern void WaitForParallelWorkersToAttach(ParallelContext *pcxt, bool wait_for_snapshot);
 extern void WaitForParallelWorkersToFinish(ParallelContext *pcxt);
 extern void DestroyParallelContext(ParallelContext *pcxt);
 extern bool ParallelContextActive(void);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8ca8f789617..d801aca82a5 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -82,6 +82,7 @@ typedef struct ParallelTableScanDescData
 	RelFileLocator phs_locator; /* physical relation to scan */
 	bool		phs_syncscan;	/* report location to syncscan logic? */
 	bool		phs_snapshot_any;	/* SnapshotAny, not phs_snapshot_data? */
+	bool		phs_reset_snapshot; /* use SO_RESET_SNAPSHOT? */
 	Size		phs_snapshot_off;	/* data for snapshot */
 } ParallelTableScanDescData;
 typedef struct ParallelTableScanDescData *ParallelTableScanDesc;
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index a328f3aea6b..66e1ad83f1a 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -1180,7 +1180,8 @@ extern Size table_parallelscan_estimate(Relation rel, Snapshot snapshot);
  */
 extern void table_parallelscan_initialize(Relation rel,
 										  ParallelTableScanDesc pscan,
-										  Snapshot snapshot);
+										  Snapshot snapshot,
+										  bool reset_snapshot);
 
 /*
  * Begin a parallel scan. `pscan` needs to have been initialized with
@@ -1798,9 +1799,9 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
  *
- * In case of non-unique index and non-parallel concurrent build
- * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
- * on the fly to allow xmin horizon propagate.
+ * In case of non-unique concurrent  index build SO_RESET_SNAPSHOT is applied
+ * for the scan. That leads for changing snapshots on the fly to allow xmin
+ * horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 5db54530f17..595a4000ce0 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -17,6 +17,12 @@ SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice'
  
 (1 row)
 
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -72,24 +78,35 @@ NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+ injection_points_detach 
+-------------------------
+ 
+(1 row)
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
-NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -97,7 +114,9 @@ REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 5072535b355..2941aa7ae38 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -3,7 +3,7 @@ CREATE EXTENSION injection_points;
 SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
-
+SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
@@ -53,6 +53,9 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
 
+-- Detach to keep test stable, since parallel worker may complete scan before leader
+SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
+
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
@@ -83,4 +86,4 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 
 DROP SCHEMA cic_reset_snap CASCADE;
 
-DROP EXTENSION injection_points;
+DROP EXTENSION injection_points;
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch (36.3K, 6-v9-0003-Allow-advancing-xmin-during-non-unique-non-parall.patch)
  download | inline diff:
From 4ee802bb929b4d401a3c69b879275fde06591866 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 17:41:29 +0100
Subject: [PATCH v9 3/9] Allow advancing xmin during non-unique, non-parallel 
 concurrent index builds by periodically resetting snapshots

Long-running transactions like those used by CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY can hold back the global xmin horizon, preventing VACUUM from cleaning up dead tuples and potentially leading to transaction ID wraparound issues. In PostgreSQL 14, commit d9d076222f5b attempted to allow VACUUM to ignore indexing transactions with CONCURRENTLY to mitigate this problem. However, this was reverted in commit e28bb8851969 because it could cause indexes to miss heap tuples that were HOT-updated and HOT-pruned during the index creation, leading to index corruption.

This patch introduces a safe alternative by periodically resetting the snapshot used during non-unique, non-parallel concurrent index builds. By resetting the snapshot every N pages during the heap scan, we allow the xmin horizon to advance without risking index corruption. This approach is safe for non-unique index builds because they do not enforce uniqueness constraints that require a consistent snapshot across the entire scan.

Currently, this technique is applied to:

Non-parallel index builds: Parallel index builds are not yet supported and will be addressed in a future commit.
Non-unique indexes: Unique index builds still require a consistent snapshot to enforce uniqueness constraints, and support for them may be added in the future.
Only during the first scan of the heap: The second scan during index validation still uses a single snapshot to ensure index correctness.

To implement this, a new scan option SO_RESET_SNAPSHOT is introduced. When set, it causes the snapshot to be reset every SO_RESET_SNAPSHOT_EACH_N_PAGE pages during the scan. The heap scan code is adjusted to support this option, and the index build code is modified to use it for applicable concurrent index builds that are not on system catalogs and not using parallel workers.

This addresses the issues that led to the reversion of commit d9d076222f5b, providing a safe way to allow xmin advancement during long-running non-unique, non-parallel concurrent index builds while ensuring index correctness.

Regression tests are added to verify the behavior.
---
 contrib/amcheck/verify_nbtree.c               |   3 +-
 contrib/pgstattuple/pgstattuple.c             |   2 +-
 src/backend/access/brin/brin.c                |  14 +++
 src/backend/access/heap/heapam.c              |  46 ++++++++
 src/backend/access/heap/heapam_handler.c      |  57 ++++++++--
 src/backend/access/index/genam.c              |   2 +-
 src/backend/access/nbtree/nbtsort.c           |  14 +++
 src/backend/catalog/index.c                   |  30 ++++-
 src/backend/commands/indexcmds.c              |  14 +--
 src/backend/optimizer/plan/planner.c          |   9 ++
 src/include/access/tableam.h                  |  28 ++++-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/cic_reset_snapshots.out          | 107 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |   1 +
 .../sql/cic_reset_snapshots.sql               |  86 ++++++++++++++
 15 files changed, 384 insertions(+), 31 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/cic_reset_snapshots.out
 create mode 100644 src/test/modules/injection_points/sql/cic_reset_snapshots.sql

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index ffe4f721672..7fb052ce3de 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -689,7 +689,8 @@ bt_check_every_level(Relation rel, Relation heaprel, bool heapkeyspace,
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 true); /* syncscan OK? */
+									 true, /* syncscan OK? */
+									 false);
 
 		/*
 		 * Scan will behave as the first scan of a CREATE INDEX CONCURRENTLY
diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index 48cb8f59c4f..ff7cc07df99 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -332,7 +332,7 @@ pgstat_heap(Relation rel, FunctionCallInfo fcinfo)
 				 errmsg("only heap AM is supported")));
 
 	/* Disable syncscan because we assume we scan from block zero upwards */
-	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false);
+	scan = table_beginscan_strat(rel, SnapshotAny, 0, NULL, true, false, false);
 	hscan = (HeapScanDesc) scan;
 
 	InitDirtySnapshot(SnapshotDirty);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 9af445cdcdd..d80394766d5 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2366,6 +2366,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -2391,9 +2392,16 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(GetTransactionSnapshot());
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BRIN_SHARED workspace.
@@ -2436,6 +2444,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -2515,6 +2525,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_brin_end_parallel(brinleader, NULL);
 		return;
 	}
@@ -2531,6 +2543,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 329e727f80d..c2860ebbf32 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -51,6 +51,7 @@
 #include "utils/datum.h"
 #include "utils/inval.h"
 #include "utils/spccache.h"
+#include "utils/injection_point.h"
 
 
 static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
@@ -568,6 +569,36 @@ heap_prepare_pagescan(TableScanDesc sscan)
 	LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 }
 
+/*
+ * Reset the active snapshot during a scan.
+ * This ensures the xmin horizon can advance while maintaining safe tuple visibility.
+ * Note: No other snapshot should be active during this operation.
+ */
+static inline void
+heap_reset_scan_snapshot(TableScanDesc sscan)
+{
+	/* Make sure no other snapshot was set as active. */
+	Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+	/* And make sure active snapshot is not registered. */
+	Assert(GetActiveSnapshot()->regd_count == 0);
+	PopActiveSnapshot();
+
+	sscan->rs_snapshot = InvalidSnapshot; /* just ot be tidy */
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	InvalidateCatalogSnapshot();
+
+	/* Goal of snapshot reset is to allow horizon to advance. */
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+#if USE_INJECTION_POINTS
+	/* In some cases it is still not possible due xid assign. */
+	if (!TransactionIdIsValid(MyProc->xid))
+		INJECTION_POINT("heap_reset_scan_snapshot_effective");
+#endif
+
+	PushActiveSnapshot(GetLatestSnapshot());
+	sscan->rs_snapshot = GetActiveSnapshot();
+}
+
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
@@ -609,7 +640,13 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 
 	scan->rs_cbuf = read_stream_next_buffer(scan->rs_read_stream, NULL);
 	if (BufferIsValid(scan->rs_cbuf))
+	{
 		scan->rs_cblock = BufferGetBlockNumber(scan->rs_cbuf);
+#define SO_RESET_SNAPSHOT_EACH_N_PAGE 64
+		if ((scan->rs_base.rs_flags & SO_RESET_SNAPSHOT) &&
+			(scan->rs_cblock % SO_RESET_SNAPSHOT_EACH_N_PAGE == 0))
+			heap_reset_scan_snapshot((TableScanDesc) scan);
+	}
 }
 
 /*
@@ -1236,6 +1273,15 @@ heap_endscan(TableScanDesc sscan)
 	if (scan->rs_parallelworkerdata != NULL)
 		pfree(scan->rs_parallelworkerdata);
 
+	if (scan->rs_base.rs_flags & SO_RESET_SNAPSHOT)
+	{
+		Assert(!(scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT));
+		/* Make sure no other snapshot was set as active. */
+		Assert(GetActiveSnapshot() == sscan->rs_snapshot);
+		/* And make sure snapshot is not registered. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+	}
+
 	if (scan->rs_base.rs_flags & SO_TEMP_SNAPSHOT)
 		UnregisterSnapshot(scan->rs_base.rs_snapshot);
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 53f572f384b..d9fce07e8ad 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1190,6 +1190,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 	ExprContext *econtext;
 	Snapshot	snapshot;
 	bool		need_unregister_snapshot = false;
+	bool		need_pop_active_snapshot = false;
+	bool		reset_snapshots = false;
 	TransactionId OldestXmin;
 	BlockNumber previous_blkno = InvalidBlockNumber;
 	BlockNumber root_blkno = InvalidBlockNumber;
@@ -1224,9 +1226,6 @@ heapam_index_build_range_scan(Relation heapRelation,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
 	 * Prepare for scan of the base relation.  In a normal index build, we use
 	 * SnapshotAny because we must retrieve all tuples and do our own time
@@ -1236,6 +1235,15 @@ heapam_index_build_range_scan(Relation heapRelation,
 	 */
 	OldestXmin = InvalidTransactionId;
 
+	/*
+	 * For unique index we need consistent snapshot for the whole scan.
+	 * In case of parallel scan some additional infrastructure required
+	 * to perform scan with SO_RESET_SNAPSHOT which is not yet ready.
+	 */
+	reset_snapshots = indexInfo->ii_Concurrent &&
+					  !indexInfo->ii_Unique &&
+					  !is_system_catalog; /* just for the case */
+
 	/* okay to ignore lazy VACUUMs here */
 	if (!IsBootstrapProcessingMode() && !indexInfo->ii_Concurrent)
 		OldestXmin = GetOldestNonRemovableTransactionId(heapRelation);
@@ -1244,24 +1252,41 @@ heapam_index_build_range_scan(Relation heapRelation,
 	{
 		/*
 		 * Serial index build.
-		 *
-		 * Must begin our own heap scan in this case.  We may also need to
-		 * register a snapshot whose lifetime is under our direct control.
 		 */
 		if (!TransactionIdIsValid(OldestXmin))
 		{
-			snapshot = RegisterSnapshot(GetTransactionSnapshot());
-			need_unregister_snapshot = true;
+			snapshot = GetTransactionSnapshot();
+			/*
+			 * Must begin our own heap scan in this case.  We may also need to
+			 * register a snapshot whose lifetime is under our direct control.
+			 * In case of resetting of snapshot during the scan registration is
+			 * not allowed because snapshot is going to be changed every so
+			 * often.
+			 */
+			if (!reset_snapshots)
+			{
+				snapshot = RegisterSnapshot(snapshot);
+				need_unregister_snapshot = true;
+			}
+			Assert(!ActiveSnapshotSet());
+			PushActiveSnapshot(snapshot);
+			/* store link to snapshot because it may be copied */
+			snapshot = GetActiveSnapshot();
+			need_pop_active_snapshot = true;
 		}
 		else
+		{
+			Assert(!indexInfo->ii_Concurrent);
 			snapshot = SnapshotAny;
+		}
 
 		scan = table_beginscan_strat(heapRelation,	/* relation */
 									 snapshot,	/* snapshot */
 									 0, /* number of keys */
 									 NULL,	/* scan key */
 									 true,	/* buffer access strategy OK */
-									 allow_sync);	/* syncscan OK? */
+									 allow_sync,	/* syncscan OK? */
+									 reset_snapshots /* reset snapshots? */);
 	}
 	else
 	{
@@ -1275,6 +1300,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 		Assert(!IsBootstrapProcessingMode());
 		Assert(allow_sync);
 		snapshot = scan->rs_snapshot;
+		PushActiveSnapshot(snapshot);
+		need_pop_active_snapshot = true;
 	}
 
 	hscan = (HeapScanDesc) scan;
@@ -1289,6 +1316,13 @@ heapam_index_build_range_scan(Relation heapRelation,
 	Assert(snapshot == SnapshotAny ? TransactionIdIsValid(OldestXmin) :
 		   !TransactionIdIsValid(OldestXmin));
 	Assert(snapshot == SnapshotAny || !anyvisible);
+	Assert(snapshot == SnapshotAny || ActiveSnapshotSet());
+
+	/* Set up execution state for predicate, if any. */
+	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	/* Clear reference to snapshot since it may be changed by the scan itself. */
+	if (reset_snapshots)
+		snapshot = InvalidSnapshot;
 
 	/* Publish number of blocks to scan */
 	if (progress)
@@ -1724,6 +1758,8 @@ heapam_index_build_range_scan(Relation heapRelation,
 
 	table_endscan(scan);
 
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	/* we can now forget our snapshot, if set and registered by us */
 	if (need_unregister_snapshot)
 		UnregisterSnapshot(snapshot);
@@ -1796,7 +1832,8 @@ heapam_index_validate_scan(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 false);	/* syncscan not OK */
+								 false,	/* syncscan not OK */
+								 false);
 	hscan = (HeapScanDesc) scan;
 
 	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 4b4ebff6a17..a104ba9df74 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -463,7 +463,7 @@ systable_beginscan(Relation heapRelation,
 		 */
 		sysscan->scan = table_beginscan_strat(heapRelation, snapshot,
 											  nkeys, key,
-											  true, false);
+											  true, false, false);
 		sysscan->iscan = NULL;
 	}
 
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 28522c0ac1c..8647422ed05 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1410,6 +1410,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	bool		leaderparticipates = true;
+	bool		need_pop_active_snapshot = true;
 	int			querylen;
 
 #ifdef DISABLE_LEADER_PARTICIPATION
@@ -1435,9 +1436,16 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * live according to that.
 	 */
 	if (!isconcurrent)
+	{
+		Assert(ActiveSnapshotSet());
 		snapshot = SnapshotAny;
+		need_pop_active_snapshot = false;
+	}
 	else
+	{
 		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+		PushActiveSnapshot(snapshot);
+	}
 
 	/*
 	 * Estimate size for our own PARALLEL_KEY_BTREE_SHARED workspace, and
@@ -1491,6 +1499,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no DSM segment was available, back out (do serial build) */
 	if (pcxt->seg == NULL)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		if (IsMVCCSnapshot(snapshot))
 			UnregisterSnapshot(snapshot);
 		DestroyParallelContext(pcxt);
@@ -1585,6 +1595,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	/* If no workers were successfully launched, back out (do serial build) */
 	if (pcxt->nworkers_launched == 0)
 	{
+		if (need_pop_active_snapshot)
+			PopActiveSnapshot();
 		_bt_end_parallel(btleader);
 		return;
 	}
@@ -1601,6 +1613,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 6976249e9e9..c5a900f1b29 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -79,6 +79,7 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tuplesort.h"
+#include "storage/proc.h"
 
 /* Potentially set by pg_upgrade_support functions */
 Oid			binary_upgrade_next_index_pg_class_oid = InvalidOid;
@@ -1491,8 +1492,8 @@ index_concurrently_build(Oid heapRelationId,
 	Relation	indexRelation;
 	IndexInfo  *indexInfo;
 
-	/* This had better make sure that a snapshot is active */
-	Assert(ActiveSnapshotSet());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	Assert(!TransactionIdIsValid(MyProc->xid));
 
 	/* Open and lock the parent heap relation */
 	heapRel = table_open(heapRelationId, ShareUpdateExclusiveLock);
@@ -1510,19 +1511,28 @@ index_concurrently_build(Oid heapRelationId,
 
 	indexRelation = index_open(indexRelationId, RowExclusiveLock);
 
+	/* BuildIndexInfo may require as snapshot for expressions and predicates */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * We have to re-build the IndexInfo struct, since it was lost in the
 	 * commit of the transaction where this concurrent index was created at
 	 * the catalog level.
 	 */
 	indexInfo = BuildIndexInfo(indexRelation);
+	/* Done with snapshot */
+	PopActiveSnapshot();
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
 	index_build(heapRel, indexRelation, indexInfo, false, true);
 
+	/* Invalidate catalog snapshot just for assert */
+	InvalidateCatalogSnapshot();
+	Assert((indexInfo->ii_ParallelWorkers || indexInfo->ii_Unique) || !TransactionIdIsValid(MyProc->xmin));
+
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
 
@@ -1533,12 +1543,19 @@ index_concurrently_build(Oid heapRelationId,
 	table_close(heapRel, NoLock);
 	index_close(indexRelation, NoLock);
 
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
 	 * Update the pg_index row to mark the index as ready for inserts. Once we
 	 * commit this transaction, any new transactions that open the table must
 	 * insert new entries into the index for insertions and non-HOT updates.
 	 */
 	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
+	/* we can do away with our snapshot */
+	PopActiveSnapshot();
 }
 
 /*
@@ -3206,7 +3223,8 @@ IndexCheckExclusion(Relation heapRelation,
 								 0, /* number of keys */
 								 NULL,	/* scan key */
 								 true,	/* buffer access strategy OK */
-								 true); /* syncscan OK */
+								 true, /* syncscan OK */
+								 false);
 
 	while (table_scan_getnextslot(scan, ForwardScanDirection, slot))
 	{
@@ -3269,12 +3287,16 @@ IndexCheckExclusion(Relation heapRelation,
  * as of the start of the scan (see table_index_build_scan), whereas a normal
  * build takes care to include recently-dead tuples.  This is OK because
  * we won't mark the index valid until all transactions that might be able
- * to see those tuples are gone.  The reason for doing that is to avoid
+ * to see those tuples are gone.  One of reasons for doing that is to avoid
  * bogus unique-index failures due to concurrent UPDATEs (we might see
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
  *
+ * Furthermore, in case of non-unique index we set SO_RESET_SNAPSHOT for the
+ * scan, which causes new snapshot to be set as active every so often. The reason
+ * for that is to propagate the xmin horizon forward.
+ *
  * Next, we mark the index "indisready" (but still not "indisvalid") and
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 932854d6c60..6c1fce8ed25 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1670,23 +1670,17 @@ DefineIndex(Oid tableId,
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We now take a new snapshot, and build the index using all tuples that
-	 * are visible in this snapshot.  We can be sure that any HOT updates to
+	 * We build the index using all tuples that are visible using single or
+	 * multiple refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
 	 * rolled back.  Thus, each visible tuple is either the end of its
 	 * HOT-chain or the extension of the chain is HOT-safe for this index.
 	 */
 
-	/* Set ActiveSnapshot since functions in the indexes may need it */
-	PushActiveSnapshot(GetTransactionSnapshot());
-
 	/* Perform concurrent build of index */
 	index_concurrently_build(tableId, indexRelationId);
 
-	/* we can do away with our snapshot */
-	PopActiveSnapshot();
-
 	/*
 	 * Commit this transaction to make the indisready update visible.
 	 */
@@ -4084,9 +4078,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/* Set ActiveSnapshot since functions in the indexes may need it */
-		PushActiveSnapshot(GetTransactionSnapshot());
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4101,7 +4092,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/* Perform concurrent build of new index */
 		index_concurrently_build(newidx->tableId, newidx->indexId);
 
-		PopActiveSnapshot();
 		CommitTransactionCommand();
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 7468961b017..1ef6c7216f4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -61,6 +61,7 @@
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 #include "utils/selfuncs.h"
+#include "utils/snapmgr.h"
 
 /* GUC parameters */
 double		cursor_tuple_fraction = DEFAULT_CURSOR_TUPLE_FRACTION;
@@ -6778,6 +6779,7 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	Relation	heap;
 	Relation	index;
 	RelOptInfo *rel;
+	bool		need_pop_active_snapshot = false;
 	int			parallel_workers;
 	BlockNumber heap_blocks;
 	double		reltuples;
@@ -6833,6 +6835,11 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 	heap = table_open(tableOid, NoLock);
 	index = index_open(indexOid, NoLock);
 
+	/* Set ActiveSnapshot since functions in the indexes may need it */
+	if (!ActiveSnapshotSet()) {
+		PushActiveSnapshot(GetTransactionSnapshot());
+		need_pop_active_snapshot = true;
+	}
 	/*
 	 * Determine if it's safe to proceed.
 	 *
@@ -6890,6 +6897,8 @@ plan_create_index_workers(Oid tableOid, Oid indexOid)
 		parallel_workers--;
 
 done:
+	if (need_pop_active_snapshot)
+		PopActiveSnapshot();
 	index_close(index, NoLock);
 	table_close(heap, NoLock);
 
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index bb32de11ea0..a328f3aea6b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -24,6 +24,7 @@
 #include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
+#include "utils/injection_point.h"
 
 
 #define DEFAULT_TABLE_ACCESS_METHOD	"heap"
@@ -69,6 +70,17 @@ typedef enum ScanOptions
 	 * needed. If table data may be needed, set SO_NEED_TUPLES.
 	 */
 	SO_NEED_TUPLES = 1 << 10,
+	/*
+	 * Reset scan and catalog snapshot every so often? If so, each
+	 * SO_RESET_SNAPSHOT_EACH_N_PAGE pages active snapshot is popped,
+	 * catalog snapshot invalidated, latest snapshot pushed as active.
+	 *
+	 * At the end of the scan snapshot is not popped.
+	 * Goal of such mode is keep xmin propagating horizon forward.
+	 *
+	 * see heap_reset_scan_snapshot for details.
+	 */
+	SO_RESET_SNAPSHOT = 1 << 11,
 }			ScanOptions;
 
 /*
@@ -935,7 +947,8 @@ extern TableScanDesc table_beginscan_catalog(Relation relation, int nkeys,
 static inline TableScanDesc
 table_beginscan_strat(Relation rel, Snapshot snapshot,
 					  int nkeys, struct ScanKeyData *key,
-					  bool allow_strat, bool allow_sync)
+					  bool allow_strat, bool allow_sync,
+					  bool reset_snapshot)
 {
 	uint32		flags = SO_TYPE_SEQSCAN | SO_ALLOW_PAGEMODE;
 
@@ -943,6 +956,15 @@ table_beginscan_strat(Relation rel, Snapshot snapshot,
 		flags |= SO_ALLOW_STRAT;
 	if (allow_sync)
 		flags |= SO_ALLOW_SYNC;
+	if (reset_snapshot)
+	{
+		INJECTION_POINT("table_beginscan_strat_reset_snapshots");
+		/* Active snapshot is required on start. */
+		Assert(GetActiveSnapshot() == snapshot);
+		/* Active snapshot should not be registered to keep xmin propagating. */
+		Assert(GetActiveSnapshot()->regd_count == 0);
+		flags |= (SO_RESET_SNAPSHOT);
+	}
 
 	return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, NULL, flags);
 }
@@ -1775,6 +1797,10 @@ table_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
  * very hard to detect whether they're really incompatible with the chain tip.
  * This only really makes sense for heap AM, it might need to be generalized
  * for other AMs later.
+ *
+ * In case of non-unique index and non-parallel concurrent build
+ * SO_RESET_SNAPSHOT is applied for the scan. That leads for changing snapshots
+ * on the fly to allow xmin horizon propagate.
  */
 static inline double
 table_index_build_scan(Relation table_rel,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index f8f86e8f3b6..73893d351bb 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc
+REGRESS = injection_points reindex_conc cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
new file mode 100644
index 00000000000..5db54530f17
--- /dev/null
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -0,0 +1,107 @@
+CREATE EXTENSION injection_points;
+SELECT injection_points_set_local();
+ injection_points_set_local 
+----------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+NOTICE:  notice triggered for injection point table_parallelscan_initialize
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP SCHEMA cic_reset_snap CASCADE;
+NOTICE:  drop cascades to 3 other objects
+DETAIL:  drop cascades to table cic_reset_snap.tbl
+drop cascades to function cic_reset_snap.predicate_stable(integer)
+drop cascades to function cic_reset_snap.predicate_stable_no_param()
+DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 91fc8ce687f..f288633da4f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -35,6 +35,7 @@ tests += {
     'sql': [
       'injection_points',
       'reindex_conc',
+      'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
     # The injection points are cluster-wide, so disable installcheck
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
new file mode 100644
index 00000000000..5072535b355
--- /dev/null
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -0,0 +1,86 @@
+CREATE EXTENSION injection_points;
+
+SELECT injection_points_set_local();
+SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
+SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
+
+
+CREATE SCHEMA cic_reset_snap;
+CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
+INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
+
+CREATE FUNCTION cic_reset_snap.predicate_stable(integer) RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN MOD($1, 2) = 0;
+END; $$;
+
+CREATE FUNCTION cic_reset_snap.predicate_stable_no_param() RETURNS bool IMMUTABLE
+									  LANGUAGE plpgsql AS $$
+BEGIN
+    EXECUTE 'SELECT txid_current()';
+    RETURN false;
+END; $$;
+
+----------------
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+-- The same in parallel mode
+ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
+
+CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
+REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
+DROP INDEX CONCURRENTLY cic_reset_snap.idx;
+
+DROP SCHEMA cic_reset_snap CASCADE;
+
+DROP EXTENSION injection_points;
-- 
2.43.0



  [application/octet-stream] v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch (61.5K, 7-v9-0001-this-is-https-commitfest.postgresql.org-50-5160-m.patch)
  download | inline diff:
From d694020bb8c9b8fa6e346029bba2500c0a0f06cc Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 30 Nov 2024 11:36:28 +0100
Subject: [PATCH v9 1/9] this is https://commitfest.postgresql.org/50/5160/
 merged in single commit. it is required for stability of stress tests.

---
 src/backend/commands/indexcmds.c              |   4 +-
 src/backend/executor/execIndexing.c           |   3 +
 src/backend/executor/execPartition.c          | 119 ++++++++-
 src/backend/executor/nodeModifyTable.c        |   2 +
 src/backend/optimizer/util/plancat.c          | 135 +++++++---
 src/backend/utils/time/snapmgr.c              |   2 +
 src/test/modules/injection_points/Makefile    |   7 +-
 .../expected/index_concurrently_upsert.out    |  80 ++++++
 .../index_concurrently_upsert_predicate.out   |  80 ++++++
 .../expected/reindex_concurrently_upsert.out  | 238 ++++++++++++++++++
 ...ndex_concurrently_upsert_on_constraint.out | 238 ++++++++++++++++++
 ...eindex_concurrently_upsert_partitioned.out | 238 ++++++++++++++++++
 src/test/modules/injection_points/meson.build |  11 +
 .../specs/index_concurrently_upsert.spec      |  68 +++++
 .../index_concurrently_upsert_predicate.spec  |  70 ++++++
 .../specs/reindex_concurrently_upsert.spec    |  86 +++++++
 ...dex_concurrently_upsert_on_constraint.spec |  86 +++++++
 ...index_concurrently_upsert_partitioned.spec |  88 +++++++
 18 files changed, 1505 insertions(+), 50 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
 create mode 100644 src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
 create mode 100644 src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4049ce1a10f..932854d6c60 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1766,6 +1766,7 @@ DefineIndex(Oid tableId,
 	 * before the reference snap was taken, we have to wait out any
 	 * transactions that might have older snapshots.
 	 */
+	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForOlderSnapshots(limitXmin, true);
@@ -4206,7 +4207,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * the same time to make sure we only get constraint violations from the
 	 * indexes with the correct names.
 	 */
-
+	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
 	/*
@@ -4285,6 +4286,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * index_drop() for more details.
 	 */
 
+	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index f0a5f8879a9..820749239ca 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -117,6 +117,7 @@
 #include "utils/multirangetypes.h"
 #include "utils/rangetypes.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 /* waitMode argument to check_exclusion_or_unique_constraint() */
 typedef enum
@@ -936,6 +937,8 @@ retry:
 	econtext->ecxt_scantuple = save_scantuple;
 
 	ExecDropSingleTupleTableSlot(existing_slot);
+	if (!conflict)
+		INJECTION_POINT("check_exclusion_or_unique_constraint_no_conflict");
 
 	return !conflict;
 }
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 76518862291..aeeee41d5f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -483,6 +483,48 @@ ExecFindPartition(ModifyTableState *mtstate,
 	return rri;
 }
 
+/*
+ * IsIndexCompatibleAsArbiter
+ * 		Checks if the indexes are identical in terms of being used
+ * 		as arbiters for the INSERT ON CONFLICT operation by comparing
+ * 		them to the provided arbiter index.
+ *
+ * Returns the true if indexes are compatible.
+ */
+static bool
+IsIndexCompatibleAsArbiter(Relation	arbiterIndexRelation,
+						   IndexInfo  *arbiterIndexInfo,
+						   Relation	indexRelation,
+						   IndexInfo  *indexInfo)
+{
+	int i;
+
+	if (arbiterIndexInfo->ii_Unique != indexInfo->ii_Unique)
+		return false;
+	/* it is not supported for cases of exclusion constraints. */
+	if (arbiterIndexInfo->ii_ExclusionOps != NULL || indexInfo->ii_ExclusionOps != NULL)
+		return false;
+	if (arbiterIndexRelation->rd_index->indnkeyatts != indexRelation->rd_index->indnkeyatts)
+		return false;
+
+	for (i = 0; i < indexRelation->rd_index->indnkeyatts; i++)
+	{
+		int			arbiterAttoNo = arbiterIndexRelation->rd_index->indkey.values[i];
+		int			attoNo = indexRelation->rd_index->indkey.values[i];
+		if (arbiterAttoNo != attoNo)
+			return false;
+	}
+
+	if (list_difference(RelationGetIndexExpressions(arbiterIndexRelation),
+						RelationGetIndexExpressions(indexRelation)) != NIL)
+		return false;
+
+	if (list_difference(RelationGetIndexPredicate(arbiterIndexRelation),
+						RelationGetIndexPredicate(indexRelation)) != NIL)
+		return false;
+	return true;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Lock the partition and initialize ResultRelInfo.  Also setup other
@@ -693,6 +735,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 		if (rootResultRelInfo->ri_onConflictArbiterIndexes != NIL)
 		{
 			List	   *childIdxs;
+			List 	   *nonAncestorIdxs = NIL;
+			int		   i, j, additional_arbiters = 0;
 
 			childIdxs = RelationGetIndexList(leaf_part_rri->ri_RelationDesc);
 
@@ -703,23 +747,74 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+				if (ancestors)
 				{
-					if (list_member_oid(ancestors, lfirst_oid(lc2)))
-						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
+					{
+						if (list_member_oid(ancestors, lfirst_oid(lc2)))
+							arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
+					}
 				}
+				else /* No ancestor was found for that index. Save it for rechecking later. */
+					nonAncestorIdxs = lappend_oid(nonAncestorIdxs, childIdx);
 				list_free(ancestors);
 			}
+
+			/*
+			 * If any non-ancestor indexes are found, we need to compare them with other
+			 * indexes of the relation that will be used as arbiters. This is necessary
+			 * when a partitioned index is processed by REINDEX CONCURRENTLY. Both indexes
+			 * must be considered as arbiters to ensure that all concurrent transactions
+			 * use the same set of arbiters.
+			 */
+			if (nonAncestorIdxs)
+			{
+				for (i = 0; i < leaf_part_rri->ri_NumIndices; i++)
+				{
+					if (list_member_oid(nonAncestorIdxs, leaf_part_rri->ri_IndexRelationDescs[i]->rd_index->indexrelid))
+					{
+						Relation nonAncestorIndexRelation = leaf_part_rri->ri_IndexRelationDescs[i];
+						IndexInfo *nonAncestorIndexInfo = leaf_part_rri->ri_IndexRelationInfo[i];
+						Assert(!list_member_oid(arbiterIndexes, nonAncestorIndexRelation->rd_index->indexrelid));
+
+						/* It is too early to us non-ready indexes as arbiters */
+						if (!nonAncestorIndexInfo->ii_ReadyForInserts)
+							continue;
+
+						for (j = 0; j < leaf_part_rri->ri_NumIndices; j++)
+						{
+							if (list_member_oid(arbiterIndexes,
+												leaf_part_rri->ri_IndexRelationDescs[j]->rd_index->indexrelid))
+							{
+								Relation arbiterIndexRelation = leaf_part_rri->ri_IndexRelationDescs[j];
+								IndexInfo *arbiterIndexInfo = leaf_part_rri->ri_IndexRelationInfo[j];
+
+								/* If non-ancestor index are compatible to arbiter - use it as arbiter too. */
+								if (IsIndexCompatibleAsArbiter(arbiterIndexRelation, arbiterIndexInfo,
+															   nonAncestorIndexRelation, nonAncestorIndexInfo))
+								{
+									arbiterIndexes = lappend_oid(arbiterIndexes,
+																 nonAncestorIndexRelation->rd_index->indexrelid);
+									additional_arbiters++;
+								}
+							}
+						}
+					}
+				}
+			}
+			list_free(nonAncestorIdxs);
+
+			/*
+			 * If the resulting lists are of inequal length, something is wrong.
+			 * (This shouldn't happen, since arbiter index selection should not
+			 * pick up a non-ready index.)
+			 *
+			 * But we need to consider an additional arbiter indexes also.
+			 */
+			if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
+				list_length(arbiterIndexes) - additional_arbiters)
+				elog(ERROR, "invalid arbiter index list");
 		}
-
-		/*
-		 * If the resulting lists are of inequal length, something is wrong.
-		 * (This shouldn't happen, since arbiter index selection should not
-		 * pick up an invalid index.)
-		 */
-		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
-			list_length(arbiterIndexes))
-			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
 
 		/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index c445c433df4..67befb6cba6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -69,6 +69,7 @@
 #include "utils/datum.h"
 #include "utils/rel.h"
 #include "utils/snapmgr.h"
+#include "utils/injection_point.h"
 
 
 typedef struct MTTargetRelLookup
@@ -1087,6 +1088,7 @@ ExecInsert(ModifyTableContext *context,
 					return NULL;
 				}
 			}
+			INJECTION_POINT("exec_insert_before_insert_speculative");
 
 			/*
 			 * Before we start insertion proper, acquire our "speculative
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c31cc3ee69f..b4f9641e588 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -714,12 +714,14 @@ infer_arbiter_indexes(PlannerInfo *root)
 	List	   *indexList;
 	ListCell   *l;
 
-	/* Normalized inference attributes and inference expressions: */
-	Bitmapset  *inferAttrs = NULL;
-	List	   *inferElems = NIL;
+	/* Normalized required attributes and expressions: */
+	Bitmapset  *requiredArbiterAttrs = NULL;
+	List	   *requiredArbiterElems = NIL;
+	List	   *requiredIndexPredExprs = (List *) onconflict->arbiterWhere;
 
 	/* Results */
 	List	   *results = NIL;
+	bool	   foundValid = false;
 
 	/*
 	 * Quickly return NIL for ON CONFLICT DO NOTHING without an inference
@@ -754,8 +756,8 @@ infer_arbiter_indexes(PlannerInfo *root)
 
 		if (!IsA(elem->expr, Var))
 		{
-			/* If not a plain Var, just shove it in inferElems for now */
-			inferElems = lappend(inferElems, elem->expr);
+			/* If not a plain Var, just shove it in requiredArbiterElems for now */
+			requiredArbiterElems = lappend(requiredArbiterElems, elem->expr);
 			continue;
 		}
 
@@ -767,30 +769,76 @@ infer_arbiter_indexes(PlannerInfo *root)
 					(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 					 errmsg("whole row unique index inference specifications are not supported")));
 
-		inferAttrs = bms_add_member(inferAttrs,
+		requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
 									attno - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	indexList = RelationGetIndexList(relation);
+
 	/*
 	 * Lookup named constraint's index.  This is not immediately returned
-	 * because some additional sanity checks are required.
+	 * because some additional sanity checks are required. Additionally, we
+	 * need to process other indexes as potential arbiters to account for
+	 * cases where REINDEX CONCURRENTLY is processing an index used as a
+	 * named constraint.
 	 */
 	if (onconflict->constraint != InvalidOid)
 	{
 		indexOidFromConstraint = get_constraint_index(onconflict->constraint);
 
 		if (indexOidFromConstraint == InvalidOid)
+		{
 			ereport(ERROR,
 					(errcode(ERRCODE_WRONG_OBJECT_TYPE),
-					 errmsg("constraint in ON CONFLICT clause has no associated index")));
+					errmsg("constraint in ON CONFLICT clause has no associated index")));
+		}
+
+		/*
+		 * Find the named constraint index to extract its attributes and predicates.
+		 * We open all indexes in the loop to avoid deadlock of changed order of locks.
+		 * */
+		foreach(l, indexList)
+		{
+			Oid			indexoid = lfirst_oid(l);
+			Relation	idxRel;
+			Form_pg_index idxForm;
+			AttrNumber	natt;
+
+			idxRel = index_open(indexoid, rte->rellockmode);
+			idxForm = idxRel->rd_index;
+
+			if (idxForm->indisready)
+			{
+				if (indexOidFromConstraint == idxForm->indexrelid)
+				{
+					/*
+					 * Prepare requirements for other indexes to be used as arbiter together
+					 * with indexOidFromConstraint. It is required to involve both equals indexes
+					 * in case of REINDEX CONCURRENTLY.
+					 */
+					for (natt = 0; natt < idxForm->indnkeyatts; natt++)
+					{
+						int			attno = idxRel->rd_index->indkey.values[natt];
+
+						if (attno != 0)
+							requiredArbiterAttrs = bms_add_member(requiredArbiterAttrs,
+														  attno - FirstLowInvalidHeapAttributeNumber);
+					}
+					requiredArbiterElems = RelationGetIndexExpressions(idxRel);
+					requiredIndexPredExprs = RelationGetIndexPredicate(idxRel);
+					/* We are done, so, quite the loop. */
+					index_close(idxRel, NoLock);
+					break;
+				}
+			}
+			index_close(idxRel, NoLock);
+		}
 	}
 
 	/*
 	 * Using that representation, iterate through the list of indexes on the
 	 * target relation to try and find a match
 	 */
-	indexList = RelationGetIndexList(relation);
-
 	foreach(l, indexList)
 	{
 		Oid			indexoid = lfirst_oid(l);
@@ -813,7 +861,13 @@ infer_arbiter_indexes(PlannerInfo *root)
 		idxRel = index_open(indexoid, rte->rellockmode);
 		idxForm = idxRel->rd_index;
 
-		if (!idxForm->indisvalid)
+		/*
+		 * We need to consider both indisvalid and indisready indexes because
+		 * them may become indisvalid before execution phase. It is required
+		 * to keep set of indexes used as arbiter to be the same for all
+		 * concurrent transactions.
+		 */
+		if (!idxForm->indisready)
 			goto next;
 
 		/*
@@ -833,27 +887,23 @@ infer_arbiter_indexes(PlannerInfo *root)
 				ereport(ERROR,
 						(errcode(ERRCODE_WRONG_OBJECT_TYPE),
 						 errmsg("ON CONFLICT DO UPDATE not supported with exclusion constraints")));
-
-			results = lappend_oid(results, idxForm->indexrelid);
-			list_free(indexList);
-			index_close(idxRel, NoLock);
-			table_close(relation, NoLock);
-			return results;
+			goto found;
 		}
 		else if (indexOidFromConstraint != InvalidOid)
 		{
-			/* No point in further work for index in named constraint case */
-			goto next;
+			/* In the case of "ON constraint_name DO UPDATE" we need to skip non-unique candidates. */
+			if (!idxForm->indisunique && onconflict->action == ONCONFLICT_UPDATE)
+				goto next;
+		}  else {
+			/*
+			 * Only considering conventional inference at this point (not named
+			 * constraints), so index under consideration can be immediately
+			 * skipped if it's not unique
+			 */
+			if (!idxForm->indisunique)
+				goto next;
 		}
 
-		/*
-		 * Only considering conventional inference at this point (not named
-		 * constraints), so index under consideration can be immediately
-		 * skipped if it's not unique
-		 */
-		if (!idxForm->indisunique)
-			goto next;
-
 		/*
 		 * So-called unique constraints with WITHOUT OVERLAPS are really
 		 * exclusion constraints, so skip those too.
@@ -873,7 +923,7 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/* Non-expression attributes (if any) must match */
-		if (!bms_equal(indexedAttrs, inferAttrs))
+		if (!bms_equal(indexedAttrs, requiredArbiterAttrs))
 			goto next;
 
 		/* Expression attributes (if any) must match */
@@ -881,6 +931,10 @@ infer_arbiter_indexes(PlannerInfo *root)
 		if (idxExprs && varno != 1)
 			ChangeVarNodes((Node *) idxExprs, 1, varno, 0);
 
+		/*
+		 * If arbiterElems are present, check them. If name >constraint is
+		 * present arbiterElems == NIL.
+		 */
 		foreach(el, onconflict->arbiterElems)
 		{
 			InferenceElem *elem = (InferenceElem *) lfirst(el);
@@ -918,27 +972,35 @@ infer_arbiter_indexes(PlannerInfo *root)
 		}
 
 		/*
-		 * Now that all inference elements were matched, ensure that the
+		 * In case of the conventional inference involved ensure that the
 		 * expression elements from inference clause are not missing any
 		 * cataloged expressions.  This does the right thing when unique
 		 * indexes redundantly repeat the same attribute, or if attributes
 		 * redundantly appear multiple times within an inference clause.
+		 *
+		 * In the case of named constraint ensure candidate has equal set
+		 * of expressions as the named constraint index.
 		 */
-		if (list_difference(idxExprs, inferElems) != NIL)
+		if (list_difference(idxExprs, requiredArbiterElems) != NIL)
 			goto next;
 
-		/*
-		 * If it's a partial index, its predicate must be implied by the ON
-		 * CONFLICT's WHERE clause.
-		 */
 		predExprs = RelationGetIndexPredicate(idxRel);
 		if (predExprs && varno != 1)
 			ChangeVarNodes((Node *) predExprs, 1, varno, 0);
 
-		if (!predicate_implied_by(predExprs, (List *) onconflict->arbiterWhere, false))
+		/*
+		 * If it's a partial index and conventional inference, its predicate must be implied
+		 * by the ON CONFLICT's WHERE clause.
+		 */
+		if (indexOidFromConstraint == InvalidOid && !predicate_implied_by(predExprs, requiredIndexPredExprs, false))
+			goto next;
+		/* If it's a partial index and named constraint predicates must be equal. */
+		if (indexOidFromConstraint != InvalidOid && list_difference(predExprs, requiredIndexPredExprs) != NIL)
 			goto next;
 
+found:
 		results = lappend_oid(results, idxForm->indexrelid);
+		foundValid |= idxForm->indisvalid;
 next:
 		index_close(idxRel, NoLock);
 	}
@@ -946,7 +1008,8 @@ next:
 	list_free(indexList);
 	table_close(relation, NoLock);
 
-	if (results == NIL)
+	/* It is required to have at least one indisvalid index during the planning. */
+	if (results == NIL || !foundValid)
 		ereport(ERROR,
 				(errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
 				 errmsg("there is no unique or exclusion constraint matching the ON CONFLICT specification")));
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6eb29b99735..101a02c5b60 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -64,6 +64,7 @@
 #include "utils/resowner.h"
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
+#include "utils/injection_point.h"
 
 
 /*
@@ -388,6 +389,7 @@ InvalidateCatalogSnapshot(void)
 		pairingheap_remove(&RegisteredSnapshots, &CatalogSnapshot->ph_node);
 		CatalogSnapshot = NULL;
 		SnapshotResetXmin();
+		INJECTION_POINT("invalidate_catalog_snapshot_end");
 	}
 }
 
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 0753a9df58c..f8f86e8f3b6 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -13,7 +13,12 @@ PGFILEDESC = "injection_points - facility for injection points"
 REGRESS = injection_points reindex_conc
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
-ISOLATION = basic inplace
+ISOLATION = basic inplace \
+			reindex_concurrently_upsert \
+			index_concurrently_upsert \
+			reindex_concurrently_upsert_partitioned \
+			reindex_concurrently_upsert_on_constraint \
+			index_concurrently_upsert_predicate
 
 TAP_TESTS = 1
 
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert.out b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
new file mode 100644
index 00000000000..7f0659e8369
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
new file mode 100644
index 00000000000..2300d5165e9
--- /dev/null
+++ b/src/test/modules/injection_points/expected/index_concurrently_upsert_predicate.out
@@ -0,0 +1,80 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_create_index s1_start_upsert s4_wakeup_define_index_before_set_valid s2_start_upsert s4_wakeup_s1_from_invalidate_catalog_snapshot s4_wakeup_s2 s4_wakeup_s1
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_define_index_before_set_valid: 
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_create_index: <... completed>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1_from_invalidate_catalog_snapshot: 
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
new file mode 100644
index 00000000000..24bbbcbdd88
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
new file mode 100644
index 00000000000..d1cfd1731c8
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_on_constraint.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
new file mode 100644
index 00000000000..c95ff264f12
--- /dev/null
+++ b/src/test/modules/injection_points/expected/reindex_concurrently_upsert_partitioned.out
@@ -0,0 +1,238 @@
+Parsed test spec with 4 sessions
+
+starting permutation: s3_start_reindex s1_start_upsert s4_wakeup_to_swap s2_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s2_start_upsert s4_wakeup_to_swap s1_start_upsert s4_wakeup_s1 s4_wakeup_s2 s4_wakeup_to_set_dead
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s2_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+
+starting permutation: s3_start_reindex s4_wakeup_to_swap s1_start_upsert s2_start_upsert s4_wakeup_s1 s4_wakeup_to_set_dead s4_wakeup_s2
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+injection_points_attach
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; <waiting ...>
+step s4_wakeup_to_swap: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s2_start_upsert: INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); <waiting ...>
+step s4_wakeup_s1: 
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s1_start_upsert: <... completed>
+step s4_wakeup_to_set_dead: 
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s4_wakeup_s2: 
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+
+injection_points_detach
+-----------------------
+                       
+(1 row)
+
+injection_points_wakeup
+-----------------------
+                       
+(1 row)
+
+step s3_start_reindex: <... completed>
+step s2_start_upsert: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 58f19001157..91fc8ce687f 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -44,7 +44,16 @@ tests += {
     'specs': [
       'basic',
       'inplace',
+      'reindex_concurrently_upsert',
+      'index_concurrently_upsert',
+      'reindex_concurrently_upsert_partitioned',
+      'reindex_concurrently_upsert_on_constraint',
+      'index_concurrently_upsert_predicate',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
+    # We waiting for all snapshots, so, avoid parallel test executions
+    'runningcheck-parallel': false,
   },
   'tap': {
     'env': {
@@ -53,5 +62,7 @@ tests += {
     'tests': [
       't/001_stats.pl',
     ],
+    # The injection points are cluster-wide, so disable installcheck
+    'runningcheck': false,
   },
 }
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
new file mode 100644
index 00000000000..075450935b6
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert.spec
@@ -0,0 +1,68 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_duplicate ON test.tbl(i); }
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
new file mode 100644
index 00000000000..70a27475e10
--- /dev/null
+++ b/src/test/modules/injection_points/specs/index_concurrently_upsert_predicate.spec
@@ -0,0 +1,70 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: CREATE UNIQUE INDEX CONCURRENTLY
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int, updated_at timestamp);
+
+	CREATE UNIQUE INDEX tbl_pkey_special ON test.tbl(abs(i)) WHERE i < 1000;
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+	SELECT injection_points_attach('invalidate_catalog_snapshot_end', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now())  on conflict(abs(i)) where i < 100 do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('define_index_before_set_valid', 'wait');
+}
+step s3_start_create_index		{ CREATE UNIQUE INDEX CONCURRENTLY tbl_pkey_special_duplicate ON test.tbl(abs(i)) WHERE i < 10000;}
+
+session s4
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s1_from_invalidate_catalog_snapshot	{
+	SELECT injection_points_detach('invalidate_catalog_snapshot_end');
+	SELECT injection_points_wakeup('invalidate_catalog_snapshot_end');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_define_index_before_set_valid	{
+	SELECT injection_points_detach('define_index_before_set_valid');
+	SELECT injection_points_wakeup('define_index_before_set_valid');
+}
+
+permutation
+	s3_start_create_index
+	s1_start_upsert
+	s4_wakeup_define_index_before_set_valid
+	s2_start_upsert
+	s4_wakeup_s1_from_invalidate_catalog_snapshot
+	s4_wakeup_s2
+	s4_wakeup_s1
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
new file mode 100644
index 00000000000..38b86d84345
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
new file mode 100644
index 00000000000..7d8e371bb0a
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_on_constraint.spec
@@ -0,0 +1,86 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE UNLOGGED TABLE test.tbl(i int primary key, updated_at timestamp);
+	ALTER TABLE test.tbl SET (parallel_workers=0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict on constraint tbl_pkey do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
diff --git a/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
new file mode 100644
index 00000000000..b9253463039
--- /dev/null
+++ b/src/test/modules/injection_points/specs/reindex_concurrently_upsert_partitioned.spec
@@ -0,0 +1,88 @@
+# Test race conditions involving:
+# - s1: UPSERT a tuple
+# - s2: UPSERT the same tuple
+# - s3: REINDEX concurrent primary key index
+# - s4: operations with injection points
+
+setup
+{
+	CREATE EXTENSION injection_points;
+	CREATE SCHEMA test;
+	CREATE TABLE test.tbl(i int primary key, updated_at timestamp) PARTITION BY RANGE (i);
+	CREATE TABLE test.tbl_partition PARTITION OF test.tbl
+		FOR VALUES FROM (0) TO (10000)
+		WITH (parallel_workers = 0);
+}
+
+teardown
+{
+	DROP SCHEMA test CASCADE;
+	DROP EXTENSION injection_points;
+}
+
+session s1
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('check_exclusion_or_unique_constraint_no_conflict', 'wait');
+}
+step s1_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s2
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('exec_insert_before_insert_speculative', 'wait');
+}
+step s2_start_upsert	{ INSERT INTO test.tbl VALUES(13,now()) on conflict(i) do update set updated_at = now(); }
+
+session s3
+setup	{
+	SELECT injection_points_set_local();
+	SELECT injection_points_attach('reindex_relation_concurrently_before_set_dead', 'wait');
+	SELECT injection_points_attach('reindex_relation_concurrently_before_swap', 'wait');
+}
+step s3_start_reindex			{ REINDEX INDEX CONCURRENTLY test.tbl_partition_pkey; }
+
+session s4
+step s4_wakeup_to_swap		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_swap');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_swap');
+}
+step s4_wakeup_s1		{
+	SELECT injection_points_detach('check_exclusion_or_unique_constraint_no_conflict');
+	SELECT injection_points_wakeup('check_exclusion_or_unique_constraint_no_conflict');
+}
+step s4_wakeup_s2		{
+	SELECT injection_points_detach('exec_insert_before_insert_speculative');
+	SELECT injection_points_wakeup('exec_insert_before_insert_speculative');
+}
+step s4_wakeup_to_set_dead		{
+	SELECT injection_points_detach('reindex_relation_concurrently_before_set_dead');
+	SELECT injection_points_wakeup('reindex_relation_concurrently_before_set_dead');
+}
+
+permutation
+	s3_start_reindex
+	s1_start_upsert
+	s4_wakeup_to_swap
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s2_start_upsert
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_s2
+	s4_wakeup_to_set_dead
+
+permutation
+	s3_start_reindex
+	s4_wakeup_to_swap
+	s1_start_upsert
+	s2_start_upsert
+	s4_wakeup_s1
+	s4_wakeup_to_set_dead
+	s4_wakeup_s2
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch (37.3K, 8-v9-0006-Add-STIR-Short-Term-Index-Replacement-access-meth.patch)
  download | inline diff:
From 2976d46c4c65c844c1fe5c369c6b9942ccaf14cb Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Sat, 21 Dec 2024 18:36:10 +0100
Subject: [PATCH v9 6/9] Add STIR (Short-Term Index Replacement) access method

This patch provides foundational infrastructure for upcoming enhancements to
concurrent index builds by introducing:

- **ii_Auxiliary** in `IndexInfo`: Indicates that an index is an auxiliary
  index, specifically for use during concurrent index builds.
- **validate_index** in `IndexVacuumInfo`: Signals when a vacuum or cleanup
  operation is validating a newly built index (e.g., during concurrent build).

Additionally, a new **STIR (Short-Term Index Replacement)** access method is
introduced, intended solely for short-lived, auxiliary usage. STIR functions
as an ephemeral helper during concurrent index builds, temporarily storing TIDs
without providing the full features of a typical index. As such, it raises
warnings or errors when accessed outside its specialized usage path.

These changes lay essential groundwork for further improvements to concurrent
index builds.
---
 contrib/pgstattuple/pgstattuple.c        |   3 +
 src/backend/access/Makefile              |   2 +-
 src/backend/access/heap/vacuumlazy.c     |   2 +
 src/backend/access/meson.build           |   1 +
 src/backend/access/stir/Makefile         |  18 +
 src/backend/access/stir/meson.build      |   5 +
 src/backend/access/stir/stir.c           | 576 +++++++++++++++++++++++
 src/backend/catalog/index.c              |   1 +
 src/backend/commands/analyze.c           |   1 +
 src/backend/commands/vacuumparallel.c    |   1 +
 src/backend/nodes/makefuncs.c            |   1 +
 src/include/access/genam.h               |   1 +
 src/include/access/reloptions.h          |   3 +-
 src/include/access/stir.h                | 117 +++++
 src/include/catalog/pg_am.dat            |   3 +
 src/include/catalog/pg_opclass.dat       |   4 +
 src/include/catalog/pg_opfamily.dat      |   2 +
 src/include/catalog/pg_proc.dat          |   4 +
 src/include/nodes/execnodes.h            |   6 +-
 src/include/utils/index_selfuncs.h       |   8 +
 src/test/regress/expected/amutils.out    |   8 +-
 src/test/regress/expected/opr_sanity.out |   7 +-
 src/test/regress/expected/psql.out       |  24 +-
 23 files changed, 780 insertions(+), 18 deletions(-)
 create mode 100644 src/backend/access/stir/Makefile
 create mode 100644 src/backend/access/stir/meson.build
 create mode 100644 src/backend/access/stir/stir.c
 create mode 100644 src/include/access/stir.h

diff --git a/contrib/pgstattuple/pgstattuple.c b/contrib/pgstattuple/pgstattuple.c
index ff7cc07df99..007efc4ed0c 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -282,6 +282,9 @@ pgstat_relation(Relation rel, FunctionCallInfo fcinfo)
 			case SPGIST_AM_OID:
 				err = "spgist index";
 				break;
+			case STIR_AM_OID:
+				err = "stir index";
+				break;
 			case BRIN_AM_OID:
 				err = "brin index";
 				break;
diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index 1932d11d154..cd6524a54ab 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  sequence table tablesample transam
+			  stir sequence table tablesample transam
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f2ca9430581..bec79b48cb2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2538,6 +2538,7 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
@@ -2589,6 +2590,7 @@ lazy_cleanup_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 
 	ivinfo.num_heap_tuples = reltuples;
 	ivinfo.strategy = vacrel->bstrategy;
+	ivinfo.validate_index = false;
 
 	/*
 	 * Update error traceback information.
diff --git a/src/backend/access/meson.build b/src/backend/access/meson.build
index 62a371db7f7..63ee0ef134d 100644
--- a/src/backend/access/meson.build
+++ b/src/backend/access/meson.build
@@ -11,6 +11,7 @@ subdir('nbtree')
 subdir('rmgrdesc')
 subdir('sequence')
 subdir('spgist')
+subdir('stir')
 subdir('table')
 subdir('tablesample')
 subdir('transam')
diff --git a/src/backend/access/stir/Makefile b/src/backend/access/stir/Makefile
new file mode 100644
index 00000000000..fae5898b8d7
--- /dev/null
+++ b/src/backend/access/stir/Makefile
@@ -0,0 +1,18 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/stir
+#
+# IDENTIFICATION
+#    src/backend/access/stir/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/stir
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = \
+	stir.o
+
+include $(top_srcdir)/src/backend/common.mk
\ No newline at end of file
diff --git a/src/backend/access/stir/meson.build b/src/backend/access/stir/meson.build
new file mode 100644
index 00000000000..39c6eca848d
--- /dev/null
+++ b/src/backend/access/stir/meson.build
@@ -0,0 +1,5 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+backend_sources += files(
+	'stir.c',
+)
\ No newline at end of file
diff --git a/src/backend/access/stir/stir.c b/src/backend/access/stir/stir.c
new file mode 100644
index 00000000000..83aa255176f
--- /dev/null
+++ b/src/backend/access/stir/stir.c
@@ -0,0 +1,576 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.c
+ *	  Implementation of Short-Term Index Replacement.
+ *
+ * STIR is a specialized access method type designed for temporary storage
+ * of TID values during concurernt index build operations.
+ *
+ * The typical lifecycle of a STIR index is:
+ * 1. created as an auxiliary index for CIC/RIC
+ * 2. accepts inserts for a period
+ * 3. stirbulkdelete called during index validation phase
+ * 5. gets dropped
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/stir/stir.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/stir.h"
+#include "commands/vacuum.h"
+#include "utils/index_selfuncs.h"
+#include "catalog/pg_opclass.h"
+#include "catalog/pg_opfamily.h"
+#include "utils/catcache.h"
+#include "access/amvalidate.h"
+#include "utils/syscache.h"
+#include "access/htup_details.h"
+#include "catalog/pg_amproc.h"
+#include "catalog/index.h"
+#include "catalog/pg_amop.h"
+#include "utils/regproc.h"
+#include "storage/bufmgr.h"
+#include "access/tableam.h"
+#include "access/reloptions.h"
+#include "utils/memutils.h"
+#include "utils/fmgrprotos.h"
+
+/*
+ * Stir handler function: return IndexAmRoutine with access method parameters
+ * and callbacks.
+ */
+Datum
+stirhandler(PG_FUNCTION_ARGS)
+{
+	IndexAmRoutine *amroutine = makeNode(IndexAmRoutine);
+
+	/* Set STIR-specific strategy and procedure numbers */
+	amroutine->amstrategies = STIR_NSTRATEGIES;
+	amroutine->amsupport = STIR_NPROC;
+	amroutine->amoptsprocnum = STIR_OPTIONS_PROC;
+
+	/* STIR doesn't support most index operations */
+	amroutine->amcanorder = false;
+	amroutine->amcanorderbyop = false;
+	amroutine->amcanbackward = false;
+	amroutine->amcanunique = false;
+	amroutine->amcanmulticol = true;
+	amroutine->amoptionalkey = true;
+	amroutine->amsearcharray = false;
+	amroutine->amsearchnulls = false;
+	amroutine->amstorage = false;
+	amroutine->amclusterable = false;
+	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
+	amroutine->amcanbuildparallel = false;
+	amroutine->amcaninclude = true;
+	amroutine->amusemaintenanceworkmem = false;
+	amroutine->amparallelvacuumoptions =
+			VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_CLEANUP;
+	amroutine->amkeytype = InvalidOid;
+
+	/* Set up function callbacks */
+	amroutine->ambuild = stirbuild;
+	amroutine->ambuildempty = stirbuildempty;
+	amroutine->aminsert = stirinsert;
+	amroutine->aminsertcleanup = NULL;
+	amroutine->ambulkdelete = stirbulkdelete;
+	amroutine->amvacuumcleanup = stirvacuumcleanup;
+	amroutine->amcanreturn = NULL;
+	amroutine->amcostestimate = stircostestimate;
+	amroutine->amoptions = stiroptions;
+	amroutine->amproperty = NULL;
+	amroutine->ambuildphasename = NULL;
+	amroutine->amvalidate = stirvalidate;
+	amroutine->amadjustmembers = NULL;
+	amroutine->ambeginscan = stirbeginscan;
+	amroutine->amrescan = stirrescan;
+	amroutine->amgettuple = NULL;
+	amroutine->amgetbitmap = NULL;
+	amroutine->amendscan = stirendscan;
+	amroutine->ammarkpos = NULL;
+	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
+
+	PG_RETURN_POINTER(amroutine);
+}
+
+/*
+ * Validates operator class for STIR index.
+ *
+ * STIR is not an real index, so validatio may be skipped.
+ * But we do it just for consistency.
+ */
+bool
+stirvalidate(Oid opclassoid)
+{
+	bool result = true;
+	HeapTuple classtup;
+	Form_pg_opclass classform;
+	Oid opfamilyoid;
+	HeapTuple familytup;
+	Form_pg_opfamily familyform;
+	char *opfamilyname;
+	CatCList *proclist,
+			*oprlist;
+	int i;
+
+	/* Fetch opclass information */
+	classtup = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclassoid));
+	if (!HeapTupleIsValid(classtup))
+		elog(ERROR, "cache lookup failed for operator class %u", opclassoid);
+	classform = (Form_pg_opclass) GETSTRUCT(classtup);
+
+	opfamilyoid = classform->opcfamily;
+
+
+	/* Fetch opfamily information */
+	familytup = SearchSysCache1(OPFAMILYOID, ObjectIdGetDatum(opfamilyoid));
+	if (!HeapTupleIsValid(familytup))
+		elog(ERROR, "cache lookup failed for operator family %u", opfamilyoid);
+	familyform = (Form_pg_opfamily) GETSTRUCT(familytup);
+
+	opfamilyname = NameStr(familyform->opfname);
+
+	/* Fetch all operators and support functions of the opfamily */
+	oprlist = SearchSysCacheList1(AMOPSTRATEGY, ObjectIdGetDatum(opfamilyoid));
+	proclist = SearchSysCacheList1(AMPROCNUM, ObjectIdGetDatum(opfamilyoid));
+
+	/* Check individual operators */
+	for (i = 0; i < oprlist->n_members; i++)
+	{
+		HeapTuple oprtup = &oprlist->members[i]->tuple;
+		Form_pg_amop oprform = (Form_pg_amop) GETSTRUCT(oprtup);
+
+		/* Check it's allowed strategy for stir */
+		if (oprform->amopstrategy < 1 ||
+			oprform->amopstrategy > STIR_NSTRATEGIES)
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with invalid strategy number %d",
+								   opfamilyname,
+								   format_operator(oprform->amopopr),
+								   oprform->amopstrategy)));
+			result = false;
+		}
+
+		/* stir doesn't support ORDER BY operators */
+		if (oprform->amoppurpose != AMOP_SEARCH ||
+			OidIsValid(oprform->amopsortfamily))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains invalid ORDER BY specification for operator %s",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+
+		/* Check operator signature --- same for all stir strategies */
+		if (!check_amop_signature(oprform->amopopr, BOOLOID,
+								  oprform->amoplefttype,
+								  oprform->amoprighttype))
+		{
+			ereport(INFO,
+					(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
+							errmsg("stir opfamily %s contains operator %s with wrong signature",
+								   opfamilyname,
+								   format_operator(oprform->amopopr))));
+			result = false;
+		}
+	}
+
+
+	ReleaseCatCacheList(proclist);
+	ReleaseCatCacheList(oprlist);
+	ReleaseSysCache(familytup);
+	ReleaseSysCache(classtup);
+
+	return result;
+}
+
+
+/*
+ * Initialize metapage of a STIR index.
+ * The skipInserts flag determines if new inserts will be accepted or skipped.
+ */
+void
+StirFillMetapage(Relation index, Page metaPage, bool skipInserts)
+{
+	StirMetaPageData *metadata;
+
+	StirInitPage(metaPage, STIR_META);
+	metadata = StirPageGetMeta(metaPage);
+	memset(metadata, 0, sizeof(StirMetaPageData));
+	metadata->magickNumber = STIR_MAGICK_NUMBER;
+	metadata->skipInserts = skipInserts;
+	((PageHeader) metaPage)->pd_lower += sizeof(StirMetaPageData);
+}
+
+/*
+ * Create and initialize the metapage for a STIR index.
+ * This is called during index creation.
+ */
+void
+StirInitMetapage(Relation index, ForkNumber forknum)
+{
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	/*
+	 * Make a new page; since it is first page it should be associated with
+	 * block number 0 (STIR_METAPAGE_BLKNO).  No need to hold the extension
+	 * lock because there cannot be concurrent inserters yet.
+	 */
+	metaBuffer = ReadBufferExtended(index, forknum, P_NEW, RBM_NORMAL, NULL);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+	Assert(BufferGetBlockNumber(metaBuffer) == STIR_METAPAGE_BLKNO);
+
+	/* Initialize contents of meta page */
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	StirFillMetapage(index, metaPage, forknum == INIT_FORKNUM);
+	GenericXLogFinish(state);
+
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+/*
+ * Initialize any page of a stir index.
+ */
+void
+StirInitPage(Page page, uint16 flags)
+{
+	StirPageOpaque opaque;
+
+	PageInit(page, BLCKSZ, sizeof(StirPageOpaqueData));
+
+	opaque = StirPageGetOpaque(page);
+	opaque->flags = flags;
+	opaque->stir_page_id = STIR_PAGE_ID;
+}
+
+/*
+ * Add a tuple to a STIR page. Returns false if tuple doesn't fit.
+ * The tuple is added to the end of the page.
+ */
+static bool
+StirPageAddItem(Page page, StirTuple *tuple)
+{
+	StirTuple *itup;
+	StirPageOpaque opaque;
+	Pointer ptr;
+
+	/* We shouldn't be pointed to an invalid page */
+	Assert(!PageIsNew(page));
+
+	/* Does new tuple fit on the page? */
+	if (StirPageGetFreeSpace(state, page) < sizeof(StirTuple))
+		return false;
+
+	/* Copy new tuple to the end of page */
+	opaque = StirPageGetOpaque(page);
+	itup = StirPageGetTuple(page, opaque->maxoff + 1);
+	memcpy((Pointer) itup, (Pointer) tuple, sizeof(StirTuple));
+
+	/* Adjust maxoff and pd_lower */
+	opaque->maxoff++;
+	ptr = (Pointer) StirPageGetTuple(page, opaque->maxoff + 1);
+	((PageHeader) page)->pd_lower = ptr - page;
+
+	/* Assert we didn't overrun available space */
+	Assert(((PageHeader) page)->pd_lower <= ((PageHeader) page)->pd_upper);
+	return true;
+}
+
+/*
+ * Insert a new tuple into a STIR index.
+ */
+bool
+stirinsert(Relation index, Datum *values, bool *isnull,
+		  ItemPointer ht_ctid, Relation heapRel,
+		  IndexUniqueCheck checkUnique,
+		  bool indexUnchanged,
+		  struct IndexInfo *indexInfo)
+{
+	StirTuple *itup;
+	MemoryContext oldCtx;
+	MemoryContext insertCtx;
+	StirMetaPageData *metaData;
+	Buffer buffer,
+			metaBuffer;
+	Page page;
+	GenericXLogState *state;
+	uint16 blkNo;
+
+	/* Create temporary context for insert operation */
+	insertCtx = AllocSetContextCreate(CurrentMemoryContext,
+									  "Stir insert temporary context",
+									  ALLOCSET_DEFAULT_SIZES);
+
+	oldCtx = MemoryContextSwitchTo(insertCtx);
+
+	/* Create new tuple with heap pointer */
+	itup = (StirTuple *) palloc0(sizeof(StirTuple));
+	itup->heapPtr = *ht_ctid;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+
+	for (;;)
+	{
+		LockBuffer(metaBuffer, BUFFER_LOCK_SHARE);
+		metaData = StirPageGetMeta(BufferGetPage(metaBuffer));
+		/* Check if inserts are allowed */
+		if (metaData->skipInserts)
+		{
+			UnlockReleaseBuffer(metaBuffer);
+			return false;
+		}
+		blkNo = metaData->lastBlkNo;
+		/* Don't hold metabuffer lock while doing insert */
+		LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+
+		if (blkNo > 0)
+		{
+			buffer = ReadBuffer(index, blkNo);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			state = GenericXLogStart(index);
+			page = GenericXLogRegisterBuffer(state, buffer, 0);
+
+			Assert(!PageIsNew(page));
+
+			/* Try to add tuple to existing page */
+			if (StirPageAddItem(page, itup))
+			{
+				/* Success!  Apply the change, clean up, and exit */
+				GenericXLogFinish(state);
+				UnlockReleaseBuffer(buffer);
+				ReleaseBuffer(metaBuffer);
+				MemoryContextSwitchTo(oldCtx);
+				MemoryContextDelete(insertCtx);
+				return false;
+			}
+
+			/* Didn't fit, must try other pages */
+			GenericXLogAbort(state);
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Need to add new page - get exclusive lock on meta page */
+		LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+		state = GenericXLogStart(index);
+		metaData = StirPageGetMeta(GenericXLogRegisterBuffer(state, metaBuffer, GENERIC_XLOG_FULL_IMAGE));
+		/* Check if another backend already extended the index */
+
+		if (blkNo != metaData->lastBlkNo)
+		{
+			Assert(blkNo < metaData->lastBlkNo);
+			/* Someone else inserted the new page into the index, lets try again /
+			 */
+			GenericXLogAbort(state);
+			LockBuffer(metaBuffer, BUFFER_LOCK_UNLOCK);
+			continue;
+		}
+		else
+		{
+			/* Must extend the file */
+			buffer = ExtendBufferedRel(BMR_REL(index), MAIN_FORKNUM, NULL,
+									   EB_LOCK_FIRST);
+
+			page = GenericXLogRegisterBuffer(state, buffer, GENERIC_XLOG_FULL_IMAGE);
+			StirInitPage(page, 0);
+
+			if (!StirPageAddItem(page, itup))
+			{
+				/* We shouldn't be here since we're inserting to an empty page */
+				elog(ERROR, "could not add new stir tuple to empty page");
+			}
+
+			/* Update meta page with new last block number */
+			metaData->lastBlkNo = BufferGetBlockNumber(buffer);
+			GenericXLogFinish(state);
+
+			UnlockReleaseBuffer(buffer);
+			UnlockReleaseBuffer(metaBuffer);
+
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextDelete(insertCtx);
+
+			return false;
+		}
+	}
+}
+
+/*
+ * STIR doesn't support scans - these functions all error out
+ */
+IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void
+stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+		  ScanKey orderbys, int norderbys)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+void stirendscan(IndexScanDesc scan)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
+
+/*
+ * Build a STIR index - only allowed for auxiliary indexes.
+ * Just initializes the meta page without any heap scans.
+ */
+IndexBuildResult *stirbuild(Relation heap, Relation index,
+						   struct IndexInfo *indexInfo)
+{
+	IndexBuildResult *result;
+
+	if (!indexInfo->ii_Auxiliary)
+		ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("STIR indexes are not supported to be built")));
+
+	StirInitMetapage(index, MAIN_FORKNUM);
+
+	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
+	result->heap_tuples = 0;
+	result->index_tuples = 0;
+	return result;
+}
+
+void stirbuildempty(Relation index)
+{
+	StirInitMetapage(index, INIT_FORKNUM);
+}
+
+IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+									 IndexBulkDeleteResult *stats,
+									 IndexBulkDeleteCallback callback,
+									 void *callback_state)
+{
+	Relation index = info->index;
+	BlockNumber blkno, npages;
+	Buffer buffer;
+	Page page;
+
+	/* For normal VACUUM, mark to skip inserts and warn about index drop needed */
+	if (!info->validate_index)
+	{
+		StirMarkAsSkipInserts(index);
+
+		ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+		return NULL;
+	}
+
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	/*
+	 * Iterate over the pages. We don't care about concurrently added pages,
+	 * because index is marked as not-ready for that momment and index not
+	 * used for insert.
+	 */
+	npages = RelationGetNumberOfBlocks(index);
+	for (blkno = STIR_HEAD_BLKNO; blkno < npages; blkno++)
+	{
+		StirTuple *itup, *itupEnd;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
+									RBM_NORMAL, info->strategy);
+
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		if (PageIsNew(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		itup = StirPageGetTuple(page, FirstOffsetNumber);
+		itupEnd = StirPageGetTuple(page, OffsetNumberNext(StirPageGetMaxOffset(page)));
+		while (itup < itupEnd)
+		{
+			/* Do we have to delete this tuple? */
+			if (callback(&itup->heapPtr, callback_state))
+			{
+				ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("we never delete in stir")));
+			}
+
+			itup = StirPageGetNextTuple(itup);
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	return stats;
+}
+
+/*
+ * Mark a STIR index to skip future inserts
+ */
+void StirMarkAsSkipInserts(Relation index)
+{
+	StirMetaPageData *metaData;
+	Buffer metaBuffer;
+	Page metaPage;
+	GenericXLogState *state;
+
+	metaBuffer = ReadBuffer(index, STIR_METAPAGE_BLKNO);
+	LockBuffer(metaBuffer, BUFFER_LOCK_EXCLUSIVE);
+
+	state = GenericXLogStart(index);
+	metaPage = GenericXLogRegisterBuffer(state, metaBuffer,
+										 GENERIC_XLOG_FULL_IMAGE);
+	metaData = StirPageGetMeta(metaPage);
+	if (!metaData->skipInserts)
+	{
+		metaData->skipInserts = true;
+		GenericXLogFinish(state);
+	}
+	else
+	{
+		GenericXLogAbort(state);
+	}
+	UnlockReleaseBuffer(metaBuffer);
+}
+
+IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+										IndexBulkDeleteResult *stats)
+{
+	StirMarkAsSkipInserts(info->index);
+	ereport(WARNING, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			errmsg("\"%s\" is not a not implemented, seems like this index need to be dropped", __func__)));
+	return NULL;
+}
+
+bytea *stiroptions(Datum reloptions, bool validate)
+{
+	return NULL;
+}
+
+void stircostestimate(PlannerInfo *root, IndexPath *path,
+					 double loop_count, Cost *indexStartupCost,
+					 Cost *indexTotalCost, Selectivity *indexSelectivity,
+					 double *indexCorrelation, double *indexPages)
+{
+	ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("\"%s\" is not a not implemented", __func__)));
+}
\ No newline at end of file
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 73454accf61..7ff7ab6c72a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3403,6 +3403,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.message_level = DEBUG2;
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
+	ivinfo.validate_index = true;
 
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 9a56de2282f..d54d310ba43 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -718,6 +718,7 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 			ivinfo.message_level = elevel;
 			ivinfo.num_heap_tuples = onerel->rd_rel->reltuples;
 			ivinfo.strategy = vac_strategy;
+			ivinfo.validate_index = false;
 
 			stats = index_vacuum_cleanup(&ivinfo, NULL);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 67cba17a564..e4327b4f7dc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -884,6 +884,7 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	ivinfo.estimated_count = pvs->shared->estimated_count;
 	ivinfo.num_heap_tuples = pvs->shared->reltuples;
 	ivinfo.strategy = pvs->bstrategy;
+	ivinfo.validate_index = false;
 
 	/* Update error traceback information */
 	pvs->indname = pstrdup(RelationGetRelationName(indrel));
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7e5df7bea4d..44a8a1f2875 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -825,6 +825,7 @@ makeIndexInfo(int numattrs, int numkeyattrs, Oid amoid, List *expressions,
 	/* initialize index-build state to default */
 	n->ii_BrokenHotChain = false;
 	n->ii_ParallelWorkers = 0;
+	n->ii_Auxiliary = false;
 
 	/* set up for possible use by index AM */
 	n->ii_Am = amoid;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81653febc18..194dbbe1d0e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -52,6 +52,7 @@ typedef struct IndexVacuumInfo
 	bool		estimated_count;	/* num_heap_tuples is an estimate */
 	int			message_level;	/* ereport level for progress messages */
 	double		num_heap_tuples;	/* tuples remaining in heap */
+	bool		validate_index; /* validating concurrently built index? */
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h
index df6923c9d50..0966397d344 100644
--- a/src/include/access/reloptions.h
+++ b/src/include/access/reloptions.h
@@ -51,8 +51,9 @@ typedef enum relopt_kind
 	RELOPT_KIND_VIEW = (1 << 9),
 	RELOPT_KIND_BRIN = (1 << 10),
 	RELOPT_KIND_PARTITIONED = (1 << 11),
+	RELOPT_KIND_STIR = (1 << 12),
 	/* if you add a new kind, make sure you update "last_default" too */
-	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_PARTITIONED,
+	RELOPT_KIND_LAST_DEFAULT = RELOPT_KIND_STIR,
 	/* some compilers treat enums as signed ints, so we can't use 1 << 31 */
 	RELOPT_KIND_MAX = (1 << 30)
 } relopt_kind;
diff --git a/src/include/access/stir.h b/src/include/access/stir.h
new file mode 100644
index 00000000000..9943c42a97e
--- /dev/null
+++ b/src/include/access/stir.h
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * stir.h
+ *	  header file for postgres stir access method implementation.
+ *
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * src/include/access/stir.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef _STIR_H_
+#define _STIR_H_
+
+#include "amapi.h"
+#include "xlog.h"
+#include "generic_xlog.h"
+#include "itup.h"
+#include "fmgr.h"
+#include "nodes/pathnodes.h"
+
+/* Support procedures numbers */
+#define STIR_NPROC				0
+
+/* Scan strategies */
+#define STIR_NSTRATEGIES		1
+
+#define STIR_OPTIONS_PROC				0
+
+/* Macros for accessing stir page structures */
+#define StirPageGetOpaque(page) ((StirPageOpaque) PageGetSpecialPointer(page))
+#define StirPageGetMaxOffset(page) (StirPageGetOpaque(page)->maxoff)
+#define StirPageIsMeta(page) \
+	((StirPageGetOpaque(page)->flags & STIR_META) != 0)
+#define StirPageGetData(page)		((StirTuple *)PageGetContents(page))
+#define StirPageGetTuple(page, offset) \
+	((StirTuple *)(PageGetContents(page) \
+		+ sizeof(StirTuple) * ((offset) - 1)))
+#define StirPageGetNextTuple(tuple) \
+	((StirTuple *)((Pointer)(tuple) + sizeof(StirTuple)))
+
+
+
+/* Preserved page numbers */
+#define STIR_METAPAGE_BLKNO	(0)
+#define STIR_HEAD_BLKNO		(1) /* first data page */
+
+
+/* Opaque for stir pages */
+typedef struct StirPageOpaqueData
+{
+	OffsetNumber maxoff;		/* number of index tuples on page */
+	uint16		flags;			/* see bit definitions below */
+	uint16		unused;			/* placeholder to force maxaligning of size of
+								 * StirPageOpaqueData and to place
+								 * stir_page_id exactly at the end of page */
+	uint16		stir_page_id;	/* for identification of STIR indexes */
+} StirPageOpaqueData;
+
+/* Stir page flags */
+#define STIR_META		(1<<0)
+
+typedef StirPageOpaqueData *StirPageOpaque;
+
+#define STIR_PAGE_ID		0xFF84
+
+/* Metadata of stir index */
+typedef struct StirMetaPageData
+{
+	uint32		magickNumber;
+	uint16		lastBlkNo;
+	bool		skipInserts;	/* should we just exit without any inserts */
+} StirMetaPageData;
+
+/* Magic number to distinguish stir pages from others */
+#define STIR_MAGICK_NUMBER (0xDBAC0DEF)
+
+#define StirPageGetMeta(page)	((StirMetaPageData *) PageGetContents(page))
+
+typedef struct StirTuple
+{
+	ItemPointerData heapPtr;
+} StirTuple;
+
+#define StirPageGetFreeSpace(state, page) \
+	(BLCKSZ - MAXALIGN(SizeOfPageHeaderData) \
+		- StirPageGetMaxOffset(page) * (sizeof(StirTuple)) \
+		- MAXALIGN(sizeof(StirPageOpaqueData)))
+
+extern void StirFillMetapage(Relation index, Page metaPage, bool skipInserts);
+extern void StirInitMetapage(Relation index, ForkNumber forknum);
+extern void StirInitPage(Page page, uint16 flags);
+extern void StirMarkAsSkipInserts(Relation index);
+
+/* index access method interface functions */
+extern bool stirvalidate(Oid opclassoid);
+extern bool stirinsert(Relation index, Datum *values, bool *isnull,
+					 ItemPointer ht_ctid, Relation heapRel,
+					 IndexUniqueCheck checkUnique,
+					 bool indexUnchanged,
+					 struct IndexInfo *indexInfo);
+extern IndexScanDesc stirbeginscan(Relation r, int nkeys, int norderbys);
+extern void stirrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
+					 ScanKey orderbys, int norderbys);
+extern void stirendscan(IndexScanDesc scan);
+extern IndexBuildResult *stirbuild(Relation heap, Relation index,
+								 struct IndexInfo *indexInfo);
+extern void stirbuildempty(Relation index);
+extern IndexBulkDeleteResult *stirbulkdelete(IndexVacuumInfo *info,
+										   IndexBulkDeleteResult *stats, IndexBulkDeleteCallback callback,
+										   void *callback_state);
+extern IndexBulkDeleteResult *stirvacuumcleanup(IndexVacuumInfo *info,
+											  IndexBulkDeleteResult *stats);
+extern bytea *stiroptions(Datum reloptions, bool validate);
+
+#endif
\ No newline at end of file
diff --git a/src/include/catalog/pg_am.dat b/src/include/catalog/pg_am.dat
index db874902820..51350df0bf0 100644
--- a/src/include/catalog/pg_am.dat
+++ b/src/include/catalog/pg_am.dat
@@ -33,5 +33,8 @@
 { oid => '3580', oid_symbol => 'BRIN_AM_OID',
   descr => 'block range index (BRIN) access method',
   amname => 'brin', amhandler => 'brinhandler', amtype => 'i' },
+{ oid => '5555', oid_symbol => 'STIR_AM_OID',
+  descr => 'short term index replacement access method',
+  amname => 'stir', amhandler => 'stirhandler', amtype => 'i' },
 
 ]
diff --git a/src/include/catalog/pg_opclass.dat b/src/include/catalog/pg_opclass.dat
index f503c652ebc..a8f0e66d15b 100644
--- a/src/include/catalog/pg_opclass.dat
+++ b/src/include/catalog/pg_opclass.dat
@@ -488,4 +488,8 @@
 
 # no brin opclass for the geometric types except box
 
+# allow any types for STIR
+{ opcmethod => 'stir', oid_symbol => 'ANY_STIR_OPS_OID', opcname => 'stir_ops',
+  opcfamily => 'stir/any_ops', opcintype => 'any'},
+
 ]
diff --git a/src/include/catalog/pg_opfamily.dat b/src/include/catalog/pg_opfamily.dat
index c8ac8c73def..41ea0c3ca50 100644
--- a/src/include/catalog/pg_opfamily.dat
+++ b/src/include/catalog/pg_opfamily.dat
@@ -304,5 +304,7 @@
   opfmethod => 'hash', opfname => 'multirange_ops' },
 { oid => '6158',
   opfmethod => 'gist', opfname => 'multirange_ops' },
+{ oid => '5558',
+  opfmethod => 'stir', opfname => 'any_ops' },
 
 ]
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2dcc2d42dac..34564109e50 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -935,6 +935,10 @@
   proname => 'brinhandler', provolatile => 'v',
   prorettype => 'index_am_handler', proargtypes => 'internal',
   prosrc => 'brinhandler' },
+{ oid => '5556', descr => 'short term index replacement access method handler',
+  proname => 'stirhandler', provolatile => 'v',
+  prorettype => 'index_am_handler', proargtypes => 'internal',
+  prosrc => 'stirhandler' },
 { oid => '3952', descr => 'brin: standalone scan new table pages',
   proname => 'brin_summarize_new_values', provolatile => 'v',
   proparallel => 'u', prorettype => 'int4', proargtypes => 'regclass',
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1590b643920..7d4e43148e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -172,12 +172,13 @@ typedef struct ExprState
  *		BrokenHotChain		did we detect any broken HOT chains?
  *		Summarizing			is it a summarizing index?
  *		ParallelWorkers		# of workers requested (excludes leader)
+ *		Auxiliary			# index-helper for concurrent build?
  *		Am					Oid of index AM
  *		AmCache				private cache area for index AM
  *		Context				memory context holding this IndexInfo
  *
- * ii_Concurrent, ii_BrokenHotChain, and ii_ParallelWorkers are used only
- * during index build; they're conventionally zeroed otherwise.
+ * ii_Concurrent, ii_BrokenHotChain, ii_Auxiliary and ii_ParallelWorkers
+ * are used only during index build; they're conventionally zeroed otherwise.
  * ----------------
  */
 typedef struct IndexInfo
@@ -206,6 +207,7 @@ typedef struct IndexInfo
 	bool		ii_Summarizing;
 	bool		ii_WithoutOverlaps;
 	int			ii_ParallelWorkers;
+	bool		ii_Auxiliary;
 	Oid			ii_Am;
 	void	   *ii_AmCache;
 	MemoryContext ii_Context;
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index a41cd2b7fd9..61f3d3dea0c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -62,6 +62,14 @@ extern void spgcostestimate(struct PlannerInfo *root,
 							Selectivity *indexSelectivity,
 							double *indexCorrelation,
 							double *indexPages);
+extern void stircostestimate(struct PlannerInfo *root,
+							struct IndexPath *path,
+							double loop_count,
+							Cost *indexStartupCost,
+							Cost *indexTotalCost,
+							Selectivity *indexSelectivity,
+							double *indexCorrelation,
+							double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 							struct IndexPath *path,
 							double loop_count,
diff --git a/src/test/regress/expected/amutils.out b/src/test/regress/expected/amutils.out
index 7ab6113c619..92c033a2010 100644
--- a/src/test/regress/expected/amutils.out
+++ b/src/test/regress/expected/amutils.out
@@ -173,7 +173,13 @@ select amname, prop, pg_indexam_has_property(a.oid, prop) as p
  spgist | can_exclude   | t
  spgist | can_include   | t
  spgist | bogus         | 
-(36 rows)
+ stir   | can_order     | f
+ stir   | can_unique    | f
+ stir   | can_multi_col | t
+ stir   | can_exclude   | f
+ stir   | can_include   | t
+ stir   | bogus         | 
+(42 rows)
 
 --
 -- additional checks for pg_index_column_has_property
diff --git a/src/test/regress/expected/opr_sanity.out b/src/test/regress/expected/opr_sanity.out
index b673642ad1d..2645d970629 100644
--- a/src/test/regress/expected/opr_sanity.out
+++ b/src/test/regress/expected/opr_sanity.out
@@ -2119,9 +2119,10 @@ FROM pg_opclass AS c1
 WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS a1
                  WHERE a1.amopfamily = c1.opcfamily
                    AND binary_coercible(c1.opcintype, a1.amoplefttype));
- opcname | opcfamily 
----------+-----------
-(0 rows)
+ opcname  | opcfamily 
+----------+-----------
+ stir_ops |      5558
+(1 row)
 
 -- Check that each operator listed in pg_amop has an associated opclass,
 -- that is one whose opcintype matches oprleft (possibly by coercion).
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 36dc31c16c4..a6d86cb4ca0 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -5074,7 +5074,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA *
 List of access methods
@@ -5088,7 +5089,8 @@ List of access methods
  heap   | Table
  heap2  | Table
  spgist | Index
-(8 rows)
+ stir   | Index
+(9 rows)
 
 \dA h*
 List of access methods
@@ -5113,9 +5115,9 @@ List of access methods
 
 \dA: extra argument "bar" ignored
 \dA+
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5124,12 +5126,13 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ *
-                             List of access methods
-  Name  | Type  |       Handler        |              Description               
---------+-------+----------------------+----------------------------------------
+                               List of access methods
+  Name  | Type  |       Handler        |                Description                 
+--------+-------+----------------------+--------------------------------------------
  brin   | Index | brinhandler          | block range index (BRIN) access method
  btree  | Index | bthandler            | b-tree index access method
  gin    | Index | ginhandler           | GIN index access method
@@ -5138,7 +5141,8 @@ List of access methods
  heap   | Table | heap_tableam_handler | heap table access method
  heap2  | Table | heap_tableam_handler | 
  spgist | Index | spghandler           | SP-GiST index access method
-(8 rows)
+ stir   | Index | stirhandler          | short term index replacement access method
+(9 rows)
 
 \dA+ h*
                      List of access methods
-- 
2.43.0



  [application/octet-stream] v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch (76.2K, 9-v9-0007-Improve-CREATE-REINDEX-INDEX-CONCURRENTLY-using-a.patch)
  download | inline diff:
From 6e38968bc529c4c72d3473d19405f5e3b79d1ff2 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 13:40:45 +0100
Subject: [PATCH v9 7/9] Improve CREATE/REINDEX INDEX CONCURRENTLY using
 auxiliary index

Modify the concurrent index building process to use an auxiliary unlogged index
during construction. This improves efficiency of concurrent
index operations by:

- Creating an auxiliary STIR (Short Term Index Replacement) index to track
  new tuples during the main index build
- Using the auxiliary index to catch all tuples inserted during the build phase
  instead of relying on a second heap scan
- Merging the auxiliary index content with the main index during validation
- Automatically cleaning up the auxiliary index after the main index is ready

This approach eliminates the need for a second full table scan during index
validation, making the process more efficient especially for large tables.
The auxiliary index is automatically dropped after the main index becomes valid.

This change affects both CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY
operations. The STIR access method is added specifically for these auxiliary
indexes and cannot be used directly by users.
---
 src/backend/access/heap/heapam_handler.c      | 383 +++++++++---------
 src/backend/catalog/index.c                   | 280 +++++++++++--
 src/backend/catalog/toasting.c                |   3 +-
 src/backend/commands/indexcmds.c              | 362 +++++++++++++----
 src/include/access/tableam.h                  |  28 +-
 src/include/catalog/index.h                   |  15 +-
 src/include/commands/progress.h               |   4 +-
 .../expected/cic_reset_snapshots.out          |  28 ++
 .../sql/cic_reset_snapshots.sql               |   1 +
 src/test/regress/expected/create_index.out    |   4 +
 src/test/regress/expected/indexing.out        |   3 +-
 src/test/regress/sql/create_index.sql         |   3 +
 12 files changed, 791 insertions(+), 323 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0f706553605..ecec3c1c080 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -41,6 +41,7 @@
 #include "storage/bufpage.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
+#include "storage/proc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
@@ -1777,246 +1778,266 @@ heapam_index_build_range_scan(Relation heapRelation,
 	return reltuples;
 }
 
-static void
+static TransactionId
 heapam_index_validate_scan(Relation heapRelation,
 						   Relation indexRelation,
 						   IndexInfo *indexInfo,
-						   Snapshot snapshot,
-						   ValidateIndexState *state)
+						   ValidateIndexState *state,
+						   ValidateIndexState *auxState)
 {
-	TableScanDesc scan;
-	HeapScanDesc hscan;
-	HeapTuple	heapTuple;
+	IndexFetchTableData *fetch;
+	TransactionId limitXmin;
+
 	Datum		values[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
-	ExprState  *predicate;
-	TupleTableSlot *slot;
-	EState	   *estate;
-	ExprContext *econtext;
-	BlockNumber root_blkno = InvalidBlockNumber;
-	OffsetNumber root_offsets[MaxHeapTuplesPerPage];
-	bool		in_index[MaxHeapTuplesPerPage];
-	BlockNumber previous_blkno = InvalidBlockNumber;
+
+	Snapshot		snapshot;
+	TupleTableSlot  *slot;
+	EState			*estate;
+	ExprContext		*econtext;
 
 	/* state variables for the merge */
-	ItemPointer indexcursor = NULL;
-	ItemPointerData decoded;
-	bool		tuplesort_empty = false;
+	ItemPointer 	indexcursor = NULL,
+					auxindexcursor = NULL,
+					prev_indexcursor = NULL;
+	ItemPointerData decoded,
+					auxdecoded,
+					prev_decoded,
+					fetched;
+	bool			tuplesort_empty = false,
+					auxtuplesort_empty = false;
+
+	Assert(!HaveRegisteredOrActiveSnapshot());
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+
+	/*
+	 * Now take the "reference snapshot" that will be used by to filter candidate
+	 * tuples.  Beware!  There might still be snapshots in
+	 * use that treat some transaction as in-progress that our reference
+	 * snapshot treats as committed.  If such a recently-committed transaction
+	 * deleted tuples in the table, we will not include them in the index; yet
+	 * those transactions which see the deleting one as still-in-progress will
+	 * expect such tuples to be there once we mark the index as valid.
+	 *
+	 * We solve this by waiting for all endangered transactions to exit before
+	 * we mark the index as valid.
+	 *
+	 * We also set ActiveSnapshot to this snap, since functions in indexes may
+	 * need a snapshot.
+	 */
+	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	PushActiveSnapshot(snapshot);
+	limitXmin = snapshot->xmin;
 
 	/*
 	 * sanity checks
 	 */
 	Assert(OidIsValid(indexRelation->rd_rel->relam));
 
-	/*
-	 * Need an EState for evaluation of index expressions and partial-index
-	 * predicates.  Also a slot to hold the current tuple.
-	 */
 	estate = CreateExecutorState();
 	econtext = GetPerTupleExprContext(estate);
 	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation),
-									&TTSOpsHeapTuple);
+									&TTSOpsBufferHeapTuple);
 
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
-	/* Set up execution state for predicate, if any. */
-	predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
-
 	/*
-	 * Prepare for scan of the base relation.  We need just those tuples
-	 * satisfying the passed-in reference snapshot.  We must disable syncscan
-	 * here, because it's critical that we read from block zero forward to
-	 * match the sorted TIDs.
+	 * Prepare to fetch heap tuples in index style. This helps to reconstruct
+	 * a tuple from the heap when we only have an ItemPointer.
 	 */
-	scan = table_beginscan_strat(heapRelation,	/* relation */
-								 snapshot,	/* snapshot */
-								 0, /* number of keys */
-								 NULL,	/* scan key */
-								 true,	/* buffer access strategy OK */
-								 false,	/* syncscan not OK */
-								 false);
-	hscan = (HeapScanDesc) scan;
+	fetch = heapam_index_fetch_begin(heapRelation);
+
+	/* Initialize pointers. */
+	ItemPointerSetInvalid(&decoded);
+	ItemPointerSetInvalid(&prev_decoded);
+	ItemPointerSetInvalid(&auxdecoded);
+	ItemPointerSetInvalid(&fetched);
 
-	pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_TOTAL,
-								 hscan->rs_nblocks);
+	/* We'll track the last "main" index position in prev_indexcursor. */
+	prev_indexcursor = &prev_decoded;
 
 	/*
-	 * Scan all tuples matching the snapshot.
+	 * Main loop: we step through the auxiliary sort (auxState->tuplesort),
+	 * which holds TIDs that must be merged with or compared to those from
+	 * the "main" sort (state->tuplesort).
 	 */
-	while ((heapTuple = heap_getnext(scan, ForwardScanDirection)) != NULL)
+	while (!auxtuplesort_empty)
 	{
-		ItemPointer heapcursor = &heapTuple->t_self;
-		ItemPointerData rootTuple;
-		OffsetNumber root_offnum;
-
+		Datum		ts_val;
+		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
-		state->htups += 1;
-
-		if ((previous_blkno == InvalidBlockNumber) ||
-			(hscan->rs_cblock != previous_blkno))
-		{
-			pgstat_progress_update_param(PROGRESS_SCAN_BLOCKS_DONE,
-										 hscan->rs_cblock);
-			previous_blkno = hscan->rs_cblock;
-		}
-
 		/*
-		 * As commented in table_index_build_scan, we should index heap-only
-		 * tuples under the TIDs of their root tuples; so when we advance onto
-		 * a new heap page, build a map of root item offsets on the page.
-		 *
-		 * This complicates merging against the tuplesort output: we will
-		 * visit the live tuples in order by their offsets, but the root
-		 * offsets that we need to compare against the index contents might be
-		 * ordered differently.  So we might have to "look back" within the
-		 * tuplesort output, but only within the current page.  We handle that
-		 * by keeping a bool array in_index[] showing all the
-		 * already-passed-over tuplesort output TIDs of the current page. We
-		 * clear that array here, when advancing onto a new heap page.
-		 */
-		if (hscan->rs_cblock != root_blkno)
+		* Attempt to fetch the next TID from the auxiliary sort. If it's
+		* empty, we set auxindexcursor to NULL.
+		*/
+		auxtuplesort_empty = !tuplesort_getdatum(auxState->tuplesort, true,
+												 false, &ts_val, &ts_isnull,
+												 NULL);
+		Assert(auxtuplesort_empty || !ts_isnull);
+		if (!auxtuplesort_empty)
 		{
-			Page		page = BufferGetPage(hscan->rs_cbuf);
-
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-			heap_get_root_tuples(page, root_offsets);
-			LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_UNLOCK);
-
-			memset(in_index, 0, sizeof(in_index));
-
-			root_blkno = hscan->rs_cblock;
+			itemptr_decode(&auxdecoded, DatumGetInt64(ts_val));
+			auxindexcursor = &auxdecoded;
 		}
-
-		/* Convert actual tuple TID to root TID */
-		rootTuple = *heapcursor;
-		root_offnum = ItemPointerGetOffsetNumber(heapcursor);
-
-		if (HeapTupleIsHeapOnly(heapTuple))
+		else
 		{
-			root_offnum = root_offsets[root_offnum - 1];
-			if (!OffsetNumberIsValid(root_offnum))
-				ereport(ERROR,
-						(errcode(ERRCODE_DATA_CORRUPTED),
-						 errmsg_internal("failed to find parent tuple for heap-only tuple at (%u,%u) in table \"%s\"",
-										 ItemPointerGetBlockNumber(heapcursor),
-										 ItemPointerGetOffsetNumber(heapcursor),
-										 RelationGetRelationName(heapRelation))));
-			ItemPointerSetOffsetNumber(&rootTuple, root_offnum);
+			auxindexcursor = NULL;
 		}
 
 		/*
-		 * "merge" by skipping through the index tuples until we find or pass
-		 * the current root tuple.
-		 */
-		while (!tuplesort_empty &&
-			   (!indexcursor ||
-				ItemPointerCompare(indexcursor, &rootTuple) < 0))
+		* If the auxiliary sort is not yet empty, we now try to synchronize
+		* the "main" sort cursor (indexcursor) with auxindexcursor. We advance
+		* the main sort cursor until we've reached or passed the auxiliary TID.
+		*/
+		if (!auxtuplesort_empty)
 		{
-			Datum		ts_val;
-			bool		ts_isnull;
-
-			if (indexcursor)
+			/*
+			 * Move the main sort forward while:
+			 *   (1) It's not exhausted (tuplesort_empty == false), and
+			 *   (2) Either indexcursor is NULL (first iteration) or
+			 *       indexcursor < auxindexcursor in TID order.
+			 */
+			while (!tuplesort_empty && (indexcursor == NULL || /* null on first time here */
+						ItemPointerCompare(indexcursor, auxindexcursor) < 0))
 			{
+				/* Keep track of the previous TID in prev_decoded. */
+				prev_decoded = decoded;
 				/*
-				 * Remember index items seen earlier on the current heap page
+				 * Get the next TID from the main sort. If it's empty,
+				 * we set indexcursor to NULL.
 				 */
-				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
-					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-			}
-
-			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
-												  false, &ts_val, &ts_isnull,
-												  NULL);
-			Assert(tuplesort_empty || !ts_isnull);
-			if (!tuplesort_empty)
-			{
-				itemptr_decode(&decoded, DatumGetInt64(ts_val));
-				indexcursor = &decoded;
-			}
-			else
-			{
-				/* Be tidy */
-				indexcursor = NULL;
+				tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
+													  false, &ts_val, &ts_isnull,
+													  NULL);
+				Assert(tuplesort_empty || !ts_isnull);
+				if (!tuplesort_empty)
+				{
+					itemptr_decode(&decoded, DatumGetInt64(ts_val));
+					indexcursor = &decoded;
+
+					/*
+					 * If the current TID in the main sort is a duplicate of the
+					 * previous one (prev_indexcursor), skip it to avoid
+					 * double-inserting the same TID. Such situation is possible
+					 * due concurrent page splits in btree (and, probabaly other
+					 * indexes as well).
+					 */
+					if (ItemPointerCompare(prev_indexcursor, indexcursor) == 0)
+					{
+						elog(DEBUG5, "skipping duplicate tid in target index snapshot: (%u,%u)",
+							 ItemPointerGetBlockNumber(indexcursor),
+							 ItemPointerGetOffsetNumber(indexcursor));
+					}
+				}
+				else
+				{
+					indexcursor = NULL;
+				}
+
+				CHECK_FOR_INTERRUPTS();
 			}
-		}
-
-		/*
-		 * If the tuplesort has overshot *and* we didn't see a match earlier,
-		 * then this tuple is missing from the index, so insert it.
-		 */
-		if ((tuplesort_empty ||
-			 ItemPointerCompare(indexcursor, &rootTuple) > 0) &&
-			!in_index[root_offnum - 1])
-		{
-			MemoryContextReset(econtext->ecxt_per_tuple_memory);
-
-			/* Set up for predicate or expression evaluation */
-			ExecStoreHeapTuple(heapTuple, slot, false);
 
 			/*
-			 * In a partial index, discard tuples that don't satisfy the
-			 * predicate.
+			 * Now, if either:
+			 *  - the main sort is empty, or
+			 *  - indexcursor > auxindexcursor,
+			 *
+			 * then auxindexcursor identifies a TID that doesn't appear in
+			 * the main sort. We likely need to insert it
+			 * into the target index if it’s visible in the heap.
 			 */
-			if (predicate != NULL)
+			if (tuplesort_empty || ItemPointerCompare(indexcursor, auxindexcursor) > 0)
 			{
-				if (!ExecQual(predicate, econtext))
-					continue;
-			}
+				bool call_again = false;
+				bool all_dead = false;
+				ItemPointer tid;
 
-			/*
-			 * For the current heap tuple, extract all the attributes we use
-			 * in this index, and note which are null.  This also performs
-			 * evaluation of any expressions needed.
-			 */
-			FormIndexDatum(indexInfo,
-						   slot,
-						   estate,
-						   values,
-						   isnull);
+				/* Copy the auxindexcursor TID into fetched. */
+				fetched = *auxindexcursor;
+				tid = &fetched;
 
-			/*
-			 * You'd think we should go ahead and build the index tuple here,
-			 * but some index AMs want to do further processing on the data
-			 * first. So pass the values[] and isnull[] arrays, instead.
-			 */
-
-			/*
-			 * If the tuple is already committed dead, you might think we
-			 * could suppress uniqueness checking, but this is no longer true
-			 * in the presence of HOT, because the insert is actually a proxy
-			 * for a uniqueness check on the whole HOT-chain.  That is, the
-			 * tuple we have here could be dead because it was already
-			 * HOT-updated, and if so the updating transaction will not have
-			 * thought it should insert index entries.  The index AM will
-			 * check the whole HOT-chain and correctly detect a conflict if
-			 * there is one.
-			 */
+				/* Reset the per-tuple memory context for the next fetch. */
+				MemoryContextReset(econtext->ecxt_per_tuple_memory);
+				state->htups += 1;
 
-			index_insert(indexRelation,
-						 values,
-						 isnull,
-						 &rootTuple,
-						 heapRelation,
-						 indexInfo->ii_Unique ?
-						 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
-						 false,
-						 indexInfo);
-
-			state->tups_inserted += 1;
+				/*
+				 * Fetch the tuple from the heap to see if it's visible
+				 * under our snapshot. If it is, form the index key values
+				 * and insert a new entry into the target index.
+				 */
+				if (heapam_index_fetch_tuple(fetch, tid, snapshot, slot, &call_again, &all_dead))
+				{
+
+					/* Compute the key values and null flags for this tuple. */
+					FormIndexDatum(indexInfo,
+								   slot,
+								   estate,
+								   values,
+								   isnull);
+
+					/*
+					 * Insert the tuple into the target index.
+					 */
+					index_insert(indexRelation,
+								 values,
+								 isnull,
+								 auxindexcursor, /* insert root tuple */
+								 heapRelation,
+								 indexInfo->ii_Unique ?
+								 UNIQUE_CHECK_YES : UNIQUE_CHECK_NO,
+								 false,
+								 indexInfo);
+
+					state->tups_inserted += 1;
+
+					elog(DEBUG5, "inserted tid: (%u,%u), root: (%u, %u)",
+											ItemPointerGetBlockNumber(auxindexcursor),
+											ItemPointerGetOffsetNumber(auxindexcursor),
+											ItemPointerGetBlockNumber(tid),
+											ItemPointerGetOffsetNumber(tid));
+				}
+				else
+				{
+					/*
+					 * The tuple wasn't visible under our snapshot. We
+					 * skip inserting it into the target index because
+					 * from our perspective, it doesn't exist.
+					 */
+					elog(DEBUG5, "skipping insert to target index because tid not visible: (%u,%u)",
+						 ItemPointerGetBlockNumber(auxindexcursor),
+						 ItemPointerGetOffsetNumber(auxindexcursor));
+				}
+			}
 		}
 	}
 
-	table_endscan(scan);
-
 	ExecDropSingleTupleTableSlot(slot);
 
 	FreeExecutorState(estate);
 
+	heapam_index_fetch_end(fetch);
+
+	/*
+	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * snapshot holders, else we will deadlock against other processes also
+	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
+	 * they must wait for.
+	 */
+	PopActiveSnapshot();
+	UnregisterSnapshot(snapshot);
+	InvalidateCatalogSnapshot();
+	Assert(MyProc->xmin == InvalidTransactionId);
+#if USE_INJECTION_POINTS
+	if (MyProc->xid == InvalidTransactionId)
+		INJECTION_POINT("heapam_index_validate_scan_no_xid");
+#endif
 	/* These may have been pointing to the now-gone estate */
 	indexInfo->ii_ExpressionsState = NIL;
 	indexInfo->ii_PredicateState = NULL;
+
+	return limitXmin;
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 7ff7ab6c72a..8b14f66affc 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -719,6 +719,9 @@ UpdateIndexRelation(Oid indexoid,
  * allow_system_table_mods: allow table to be a system catalog
  * is_internal: if true, post creation hook for new index
  * constraintId: if not NULL, receives OID of created constraint
+ * relpersistence: persistence level to use for index. In most of the
+ *		cases it is should be equal to persistence level of table,
+ *		auxiliary indexes are only exception here.
  *
  * Returns the OID of the created index.
  */
@@ -743,7 +746,8 @@ index_create(Relation heapRelation,
 			 bits16 constr_flags,
 			 bool allow_system_table_mods,
 			 bool is_internal,
-			 Oid *constraintId)
+			 Oid *constraintId,
+			 char relpersistence)
 {
 	Oid			heapRelationId = RelationGetRelid(heapRelation);
 	Relation	pg_class;
@@ -754,11 +758,11 @@ index_create(Relation heapRelation,
 	bool		is_exclusion;
 	Oid			namespaceId;
 	int			i;
-	char		relpersistence;
 	bool		isprimary = (flags & INDEX_CREATE_IS_PRIMARY) != 0;
 	bool		invalid = (flags & INDEX_CREATE_INVALID) != 0;
 	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;
 	bool		partitioned = (flags & INDEX_CREATE_PARTITIONED) != 0;
+	bool		auxiliary = (flags & INDEX_CREATE_AUXILIARY) != 0;
 	char		relkind;
 	TransactionId relfrozenxid;
 	MultiXactId relminmxid;
@@ -784,7 +788,6 @@ index_create(Relation heapRelation,
 	namespaceId = RelationGetNamespace(heapRelation);
 	shared_relation = heapRelation->rd_rel->relisshared;
 	mapped_relation = RelationIsMapped(heapRelation);
-	relpersistence = heapRelation->rd_rel->relpersistence;
 
 	/*
 	 * check parameters
@@ -792,6 +795,11 @@ index_create(Relation heapRelation,
 	if (indexInfo->ii_NumIndexAttrs < 1)
 		elog(ERROR, "must index at least one column");
 
+	if (indexInfo->ii_Am == STIR_AM_OID && !auxiliary)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("user-defined indexes with STIR access method are not supported")));
+
 	if (!allow_system_table_mods &&
 		IsSystemRelation(heapRelation) &&
 		IsNormalProcessingMode())
@@ -1462,7 +1470,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 							  0,
 							  true, /* allow table to be a system catalog? */
 							  false,	/* is_internal? */
-							  NULL);
+							  NULL,
+							  heapRelation->rd_rel->relpersistence);
 
 	/* Close the relations used and clean up */
 	index_close(indexRelation, NoLock);
@@ -1472,6 +1481,154 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
 	return newIndexId;
 }
 
+/*
+ * index_concurrently_create_aux
+ *
+ * Create concurrently an auxiliary index based on the definition of the one
+ * provided by caller.  The index is inserted into catalogs and needs to be
+ * built later on. This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the tablespace to use for this index.
+ */
+Oid
+index_concurrently_create_aux(Relation heapRelation, Oid mainIndexId,
+							   Oid tablespaceOid, const char *newName)
+{
+	Relation	indexRelation;
+	IndexInfo  *oldInfo,
+			*newInfo;
+	Oid			newIndexId = InvalidOid;
+	HeapTuple	indexTuple;
+
+	List	   *indexColNames = NIL;
+	List	   *indexExprs = NIL;
+	List	   *indexPreds = NIL;
+
+	Oid *auxOpclassIds;
+	int16 *auxColoptions;
+
+	indexRelation = index_open(mainIndexId, RowExclusiveLock);
+
+	/* The new index needs some information from the old index */
+	oldInfo = BuildIndexInfo(indexRelation);
+
+	/*
+	 * Build of an auxiliary index with exclusion constraints is not
+	 * supported.
+	 */
+	if (oldInfo->ii_ExclusionOps != NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+						errmsg("auxiliary index creation for exclusion constraints is not supported")));
+
+	/* Get the array of class and column options IDs from index info */
+	indexTuple = SearchSysCache1(INDEXRELID, ObjectIdGetDatum(mainIndexId));
+	if (!HeapTupleIsValid(indexTuple))
+		elog(ERROR, "cache lookup failed for index %u", mainIndexId);
+
+
+	/*
+	 * Fetch the list of expressions and predicates directly from the
+	 * catalogs.  This cannot rely on the information from IndexInfo of the
+	 * old index as these have been flattened for the planner.
+	 */
+	if (oldInfo->ii_Expressions != NIL)
+	{
+		Datum		exprDatum;
+		char	   *exprString;
+
+		exprDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indexprs);
+		exprString = TextDatumGetCString(exprDatum);
+		indexExprs = (List *) stringToNode(exprString);
+		pfree(exprString);
+	}
+	if (oldInfo->ii_Predicate != NIL)
+	{
+		Datum		predDatum;
+		char	   *predString;
+
+		predDatum = SysCacheGetAttrNotNull(INDEXRELID, indexTuple,
+										   Anum_pg_index_indpred);
+		predString = TextDatumGetCString(predDatum);
+		indexPreds = (List *) stringToNode(predString);
+
+		/* Also convert to implicit-AND format */
+		indexPreds = make_ands_implicit((Expr *) indexPreds);
+		pfree(predString);
+	}
+
+	/*
+	 * Build the index information for the new index.  Note that rebuild of
+	 * indexes with exclusion constraints is not supported, hence there is no
+	 * need to fill all the ii_Exclusion* fields.
+	 */
+	newInfo = makeIndexInfo(oldInfo->ii_NumIndexAttrs,
+							oldInfo->ii_NumIndexKeyAttrs,
+							STIR_AM_OID, /* special AM for aux indexes */
+							indexExprs,
+							indexPreds,
+							false, /* aux index are not unique */
+							oldInfo->ii_NullsNotDistinct,
+							false,	/* not ready for inserts */
+							true,
+							false, /* aux are not summarizing */
+							oldInfo->ii_WithoutOverlaps);
+
+	/*
+	 * Extract the list of column names and the column numbers for the new
+	 * index information.  All this information will be used for the index
+	 * creation.
+	 */
+	for (int i = 0; i < oldInfo->ii_NumIndexAttrs; i++)
+	{
+		TupleDesc	indexTupDesc = RelationGetDescr(indexRelation);
+		Form_pg_attribute att = TupleDescAttr(indexTupDesc, i);
+
+		indexColNames = lappend(indexColNames, NameStr(att->attname));
+		newInfo->ii_IndexAttrNumbers[i] = oldInfo->ii_IndexAttrNumbers[i];
+	}
+
+	auxOpclassIds = palloc0(sizeof(Oid) * newInfo->ii_NumIndexAttrs);
+	auxColoptions = palloc0(sizeof(int16) * newInfo->ii_NumIndexAttrs);
+
+	/* Fill with "any ops" */
+	for (int i = 0; i < newInfo->ii_NumIndexAttrs; i++)
+	{
+		auxOpclassIds[i] = ANY_STIR_OPS_OID;
+		auxColoptions[i] = 0;
+	}
+
+	newIndexId = index_create(heapRelation,
+							  newName,
+							  InvalidOid,    /* indexRelationId */
+							  InvalidOid,    /* parentIndexRelid */
+							  InvalidOid,    /* parentConstraintId */
+							  InvalidRelFileNumber, /* relFileNumber */
+							  newInfo,
+							  indexColNames,
+							  STIR_AM_OID,
+							  tablespaceOid,
+							  indexRelation->rd_indcollation,
+							  auxOpclassIds,
+							  NULL,
+							  auxColoptions,
+							  NULL,
+							  (Datum) 0,
+							  INDEX_CREATE_SKIP_BUILD | INDEX_CREATE_CONCURRENT | INDEX_CREATE_AUXILIARY,
+							  0,
+							  true, /* allow table to be a system catalog? */
+							  false,    /* is_internal? */
+							  NULL,
+							  RELPERSISTENCE_UNLOGGED); /* aux indexes unlogged */
+
+	/* Close the relations used and clean up */
+	index_close(indexRelation, NoLock);
+	ReleaseSysCache(indexTuple);
+
+	return newIndexId;
+}
+
 /*
  * index_concurrently_build
  *
@@ -1483,7 +1640,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
  */
 void
 index_concurrently_build(Oid heapRelationId,
-						 Oid indexRelationId)
+						 Oid indexRelationId,
+						 bool auxiliary)
 {
 	Relation	heapRel;
 	Oid			save_userid;
@@ -1524,6 +1682,7 @@ index_concurrently_build(Oid heapRelationId,
 	Assert(!indexInfo->ii_ReadyForInserts);
 	indexInfo->ii_Concurrent = true;
 	indexInfo->ii_BrokenHotChain = false;
+	indexInfo->ii_Auxiliary = auxiliary;
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/* Now build the index */
@@ -3276,12 +3435,20 @@ IndexCheckExclusion(Relation heapRelation,
  *
  * We do a concurrent index build by first inserting the catalog entry for the
  * index via index_create(), marking it not indisready and not indisvalid.
+ * Then we create special auxiliary index the same way. It based on STIR AM.
  * Then we commit our transaction and start a new one, then we wait for all
  * transactions that could have been modifying the table to terminate.  Now
- * we know that any subsequently-started transactions will see the index and
+ * we know that any subsequently-started transactions will see indexes and
  * honor its constraints on HOT updates; so while existing HOT-chains might
  * be broken with respect to the index, no currently live tuple will have an
- * incompatible HOT update done to it.  We now build the index normally via
+ * incompatible HOT update done to it.
+ *
+ * After we build auxiliary index. It is fast operation without any actual
+ * table scan. As result, we have empty STIR index. We wait again for all
+ * transactions that could have been modifying the table to terminate. At that
+ * moment all new tuples are going to be inserted into auxiliary index.
+ *
+ * We now build the index normally via
  * index_build(), while holding a weak lock that allows concurrent
  * insert/update/delete.  Also, we index only tuples that are valid
  * as of the start of the scan (see table_index_build_scan), whereas a normal
@@ -3292,6 +3459,7 @@ IndexCheckExclusion(Relation heapRelation,
  * different versions of the same row as being valid when we pass over them,
  * if we used HeapTupleSatisfiesVacuum).  This leaves us with an index that
  * does not contain any tuples added to the table while we built the index.
+ * But theese tuples contained in auxiliary index.
  *
  * Furthermore, we set SO_RESET_SNAPSHOT for the scan, which causes new
  * snapshot to be set as active every so often. The reason  for that is to
@@ -3301,8 +3469,10 @@ IndexCheckExclusion(Relation heapRelation,
  * commit the second transaction and start a third.  Again we wait for all
  * transactions that could have been modifying the table to terminate.  Now
  * we know that any subsequently-started transactions will see the index and
- * insert their new tuples into it.  We then take a new reference snapshot
- * which is passed to validate_index().  Any tuples that are valid according
+ * insert their new tuples into it. At that moment we clear "indisready" for
+ * auxiliary index, since it is no more required/
+ *
+ * We then take a new reference snapshot, any tuples that are valid according
  * to this snap, but are not in the index, must be added to the index.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
@@ -3310,12 +3480,14 @@ IndexCheckExclusion(Relation heapRelation,
  * that might care about them before we mark the index valid.)
  *
  * validate_index() works by first gathering all the TIDs currently in the
- * index, using a bulkdelete callback that just stores the TIDs and doesn't
+ * indexes, using a bulkdelete callback that just stores the TIDs and doesn't
  * ever say "delete it".  (This should be faster than a plain indexscan;
  * also, not all index AMs support full-index indexscan.)  Then we sort the
- * TIDs, and finally scan the table doing a "merge join" against the TID list
- * to see which tuples are missing from the index.  Thus we will ensure that
- * all tuples valid according to the reference snapshot are in the index.
+ * TIDs of both auxiliary and target indexes, and doing a "merge join" against
+ * the TID lists to see which tuples from auxiliary index are missing from the
+ * target index.  Thus we will ensure that all tuples valid according to the
+ * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
  * tuple that is already dead or is in process of being deleted, and we
@@ -3331,24 +3503,25 @@ IndexCheckExclusion(Relation heapRelation,
  * necessary to be sure there are none left with a transaction snapshot
  * older than the reference (and hence possibly able to see tuples we did
  * not index).  Then we mark the index "indisvalid" and commit.  Subsequent
- * transactions will be able to use it for queries.
- *
- * Doing two full table scans is a brute-force strategy.  We could try to be
- * cleverer, eg storing new tuples in a special area of the table (perhaps
- * making the table append-only by setting use_fsm).  However that would
- * add yet more locking issues.
+ * transactions will be able to use it for queries. Auxiliary index is
+ * dropped.
  */
-void
-validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+TransactionId
+validate_index(Oid heapId, Oid indexId, Oid auxIndexId)
 {
 	Relation	heapRelation,
-				indexRelation;
+				indexRelation,
+				auxIndexRelation;
 	IndexInfo  *indexInfo;
-	IndexVacuumInfo ivinfo;
-	ValidateIndexState state;
+	TransactionId limitXmin;
+	IndexVacuumInfo ivinfo, auxivinfo;
+	ValidateIndexState state, auxState;
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	/* Use 80% of maintenance_work_mem to target index sorting and
+	 * rest for auxiliary */
+	int			main_work_mem_part = (maintenance_work_mem * 8) / 10;
 
 	{
 		const int	progress_index[] = {
@@ -3381,13 +3554,18 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	RestrictSearchPath();
 
 	indexRelation = index_open(indexId, RowExclusiveLock);
+	auxIndexRelation = index_open(auxIndexId, RowExclusiveLock);
 
 	/*
 	 * Fetch info needed for index_insert.  (You might think this should be
 	 * passed in from DefineIndex, but its copy is long gone due to having
 	 * been built in a previous transaction.)
+	 *
+	 * We might need snapshot for index expressions or predicates.
 	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
 	indexInfo = BuildIndexInfo(indexRelation);
+	PopActiveSnapshot();
 
 	/* mark build is concurrent just for consistency */
 	indexInfo->ii_Concurrent = true;
@@ -3405,15 +3583,30 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.strategy = NULL;
 	ivinfo.validate_index = true;
 
+	/*
+	 * Copy all info to auxiliary info, changing only relation.
+	 */
+	auxivinfo = ivinfo;
+	auxivinfo.index = auxIndexRelation;
+
 	/*
 	 * Encode TIDs as int8 values for the sort, rather than directly sorting
 	 * item pointers.  This can be significantly faster, primarily because TID
 	 * is a pass-by-reference type on all platforms, whereas int8 is
 	 * pass-by-value on most platforms.
 	 */
+	auxState.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
+										   InvalidOid, false,
+										   maintenance_work_mem - main_work_mem_part,
+										   NULL, TUPLESORT_NONE);
+	auxState.htups = auxState.itups = auxState.tups_inserted = 0;
+
+	(void) index_bulk_delete(&auxivinfo, NULL,
+							 validate_index_callback, &auxState);
+
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
-											maintenance_work_mem,
+											main_work_mem_part,
 											NULL, TUPLESORT_NONE);
 	state.htups = state.itups = state.tups_inserted = 0;
 
@@ -3436,27 +3629,33 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 		pgstat_progress_update_multi_param(3, progress_index, progress_vals);
 	}
 	tuplesort_performsort(state.tuplesort);
+	tuplesort_performsort(auxState.tuplesort);
+
+	InvalidateCatalogSnapshot();
+	Assert(!TransactionIdIsValid(MyProc->xmin));
 
 	/*
-	 * Now scan the heap and "merge" it with the index
+	 * Now merge both indexes
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN);
-	table_index_validate_scan(heapRelation,
-							  indexRelation,
-							  indexInfo,
-							  snapshot,
-							  &state);
+								 PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE);
+	limitXmin = table_index_validate_scan(heapRelation,
+										  indexRelation,
+										  indexInfo,
+										  &state,
+										  &auxState);
 
-	/* Done with tuplesort object */
+	/* Done with tuplesort objects */
 	tuplesort_end(state.tuplesort);
+	tuplesort_end(auxState.tuplesort);
 
 	/* Make sure to release resources cached in indexInfo (if needed). */
 	index_insert_cleanup(indexRelation, indexInfo);
 
 	elog(DEBUG2,
-		 "validate_index found %.0f heap tuples, %.0f index tuples; inserted %.0f missing tuples",
-		 state.htups, state.itups, state.tups_inserted);
+		 "validate_index fetched %.0f heap tuples, %.0f index tuples;"
+						" %.0f aux index tuples; inserted %.0f missing tuples",
+		 state.htups, state.itups, auxState.itups, state.tups_inserted);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);
@@ -3465,8 +3664,12 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	SetUserIdAndSecContext(save_userid, save_sec_context);
 
 	/* Close rels, but keep locks */
+	index_close(auxIndexRelation, NoLock);
 	index_close(indexRelation, NoLock);
 	table_close(heapRelation, NoLock);
+
+	Assert(!TransactionIdIsValid(MyProc->xmin));
+	return limitXmin;
 }
 
 /*
@@ -3525,6 +3728,13 @@ index_set_state_flags(Oid indexId, IndexStateFlagsAction action)
 			Assert(!indexForm->indisvalid);
 			indexForm->indisvalid = true;
 			break;
+		case INDEX_DROP_CLEAR_READY:
+			/* Clear indisready during a CREATE INDEX CONCURRENTLY sequence */
+			Assert(indexForm->indislive);
+			Assert(indexForm->indisready);
+			Assert(!indexForm->indisvalid);
+			indexForm->indisready = false;
+			break;
 		case INDEX_DROP_CLEAR_VALID:
 
 			/*
diff --git a/src/backend/catalog/toasting.c b/src/backend/catalog/toasting.c
index ad3082c62ac..fbbcd7d00dd 100644
--- a/src/backend/catalog/toasting.c
+++ b/src/backend/catalog/toasting.c
@@ -325,7 +325,8 @@ create_toast_table(Relation rel, Oid toastOid, Oid toastIndexOid,
 				 BTREE_AM_OID,
 				 rel->rd_rel->reltablespace,
 				 collationIds, opclassIds, NULL, coloptions, NULL, (Datum) 0,
-				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL);
+				 INDEX_CREATE_IS_PRIMARY, 0, true, true, NULL,
+				 toast_rel->rd_rel->relpersistence);
 
 	table_close(toast_rel, NoLock);
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a02729911fe..02b636a0050 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -554,6 +554,7 @@ DefineIndex(Oid tableId,
 {
 	bool		concurrent;
 	char	   *indexRelationName;
+	char	   *auxIndexRelationName = NULL;
 	char	   *accessMethodName;
 	Oid		   *typeIds;
 	Oid		   *collationIds;
@@ -563,6 +564,7 @@ DefineIndex(Oid tableId,
 	Oid			namespaceId;
 	Oid			tablespaceId;
 	Oid			createdConstraintId = InvalidOid;
+	Oid			auxIndexRelationId = InvalidOid;
 	List	   *indexColNames;
 	List	   *allIndexParams;
 	Relation	rel;
@@ -584,10 +586,10 @@ DefineIndex(Oid tableId,
 	int			numberOfKeyAttributes;
 	TransactionId limitXmin;
 	ObjectAddress address;
+	ObjectAddress auxAddress;
 	LockRelId	heaprelid;
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
-	Snapshot	snapshot;
 	Oid			root_save_userid;
 	int			root_save_sec_context;
 	int			root_save_nestlevel;
@@ -834,6 +836,15 @@ DefineIndex(Oid tableId,
 											stmt->excludeOpNames,
 											stmt->primary,
 											stmt->isconstraint);
+	/*
+	 * Select name for auxiliary index
+	 */
+	if (concurrent)
+		auxIndexRelationName = ChooseRelationName(indexRelationName,
+												  NULL,
+												  "ccaux",
+												  namespaceId,
+												  false);
 
 	/*
 	 * look up the access method, verify it can handle the requested features
@@ -1227,7 +1238,8 @@ DefineIndex(Oid tableId,
 					 coloptions, NULL, reloptions,
 					 flags, constr_flags,
 					 allowSystemTableMods, !check_rights,
-					 &createdConstraintId);
+					 &createdConstraintId,
+					 rel->rd_rel->relpersistence);
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
@@ -1569,6 +1581,16 @@ DefineIndex(Oid tableId,
 		return address;
 	}
 
+	/*
+	 * In case of concurrent build - create auxiliary index record.
+	 */
+	if (concurrent)
+	{
+		auxIndexRelationId = index_concurrently_create_aux(rel, indexRelationId,
+											tablespaceId, auxIndexRelationName);
+		ObjectAddressSet(auxAddress, RelationRelationId, auxIndexRelationId);
+	}
+
 	AtEOXact_GUC(false, root_save_nestlevel);
 	SetUserIdAndSecContext(root_save_userid, root_save_sec_context);
 
@@ -1597,11 +1619,11 @@ DefineIndex(Oid tableId,
 	/*
 	 * For a concurrent build, it's important to make the catalog entries
 	 * visible to other transactions before we start to build the index. That
-	 * will prevent them from making incompatible HOT updates.  The new index
-	 * will be marked not indisready and not indisvalid, so that no one else
-	 * tries to either insert into it or use it for queries.
+	 * will prevent them from making incompatible HOT updates. New indexes
+	 * (main and auxiliary) will be marked not indisready and not indisvalid,
+	 * so that no one else tries to either insert into it or use it for queries.
 	 *
-	 * We must commit our current transaction so that the index becomes
+	 * We must commit our current transaction so that the indexes becomes
 	 * visible; then start another.  Note that all the data structures we just
 	 * built are lost in the commit.  The only data we keep past here are the
 	 * relation IDs.
@@ -1611,7 +1633,7 @@ DefineIndex(Oid tableId,
 	 * cannot block, even if someone else is waiting for access, because we
 	 * already have the same lock within our transaction.
 	 *
-	 * Note: we don't currently bother with a session lock on the index,
+	 * Note: we don't currently bother with a session lock on the indexes,
 	 * because there are no operations that could change its state while we
 	 * hold lock on the parent table.  This might need to change later.
 	 */
@@ -1632,14 +1654,16 @@ DefineIndex(Oid tableId,
 	{
 		const int	progress_cols[] = {
 			PROGRESS_CREATEIDX_INDEX_OID,
+			PROGRESS_CREATEIDX_AUX_INDEX_OID,
 			PROGRESS_CREATEIDX_PHASE
 		};
 		const int64 progress_vals[] = {
 			indexRelationId,
+			auxIndexRelationId,
 			PROGRESS_CREATEIDX_PHASE_WAIT_1
 		};
 
-		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
+		pgstat_progress_update_multi_param(3, progress_cols, progress_vals);
 	}
 
 	/*
@@ -1650,7 +1674,7 @@ DefineIndex(Oid tableId,
 	 * with the old list of indexes.  Use ShareLock to consider running
 	 * transactions that hold locks that permit writing to the table.  Note we
 	 * do not need to worry about xacts that open the table for writing after
-	 * this point; they will see the new index when they open it.
+	 * this point; they will see the new indexes when they open it.
 	 *
 	 * Note: the reason we use actual lock acquisition here, rather than just
 	 * checking the ProcArray and sleeping, is that deadlock is possible if
@@ -1662,15 +1686,39 @@ DefineIndex(Oid tableId,
 
 	/*
 	 * At this moment we are sure that there are no transactions with the
-	 * table open for write that don't have this new index in their list of
+	 * table open for write that don't have this new indexes in their list of
 	 * indexes.  We have waited out all the existing transactions and any new
-	 * transaction will have the new index in its list, but the index is still
-	 * marked as "not-ready-for-inserts".  The index is consulted while
+	 * transaction will have both new indexes in its list, but indexes are still
+	 * marked as "not-ready-for-inserts". The indexes are consulted while
 	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
 	 * chains can be created where the new tuple and the old tuple in the
 	 * chain have different index keys.
 	 *
-	 * We build the index using all tuples that are visible using multiple
+	 * Now call build on auxiliary index. Index will be created empty without
+	 * any actual heap scan, but marked as "ready-for-inserts". The goal of
+	 * that index is accumulate new tuples while main index is actually built.
+	 */
+	index_concurrently_build(tableId, auxIndexRelationId, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Now we need to ensure are no transactions with the with auxiliary index
+	 * marked as "not-ready-for-inserts".
+	 */
+	WaitForLockers(heaplocktag, ShareLock, true);
+
+	/*
+	 * At this moment we are sure what all new tuples in table are inserted into
+	 * auxiliary index. Now it is time to build the target index itself.
+	 *
+	 * We build that index using all tuples that are visible using multiple
 	 * refreshing snapshots. We can be sure that any HOT updates to
 	 * these tuples will be compatible with the index, since any updates made
 	 * by transactions that didn't know about the index are now committed or
@@ -1679,7 +1727,7 @@ DefineIndex(Oid tableId,
 	 */
 
 	/* Perform concurrent build of index */
-	index_concurrently_build(tableId, indexRelationId);
+	index_concurrently_build(tableId, indexRelationId, false);
 
 	/*
 	 * Commit this transaction to make the indisready update visible.
@@ -1698,43 +1746,28 @@ DefineIndex(Oid tableId,
 	 * the index marked as read-only for updates.
 	 */
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockers(heaplocktag, ShareLock, true);
 
 	/*
-	 * Now take the "reference snapshot" that will be used by validate_index()
-	 * to filter candidate tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
-	 *
-	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
-	 *
-	 * We also set ActiveSnapshot to this snap, since functions in indexes may
-	 * need a snapshot.
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
-	PushActiveSnapshot(snapshot);
-
+	PushActiveSnapshot(GetTransactionSnapshot());
 	/*
-	 * Scan the index and the heap, insert any missing index entries.
+	 * Now target index is marked as "ready" for all transaction. So, auxiliary
+	 * index is not more needed. So, start removing process by reverting "ready"
+	 * flag.
 	 */
-	validate_index(tableId, indexRelationId, snapshot);
-
-	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
-	 * snapshot holders, else we will deadlock against other processes also
-	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
-	 * they must wait for.  But first, save the snapshot's xmin to use as
-	 * limitXmin for GetCurrentVirtualXIDs().
-	 */
-	limitXmin = snapshot->xmin;
-
+	index_set_state_flags(auxIndexRelationId, INDEX_DROP_CLEAR_READY);
 	PopActiveSnapshot();
-	UnregisterSnapshot(snapshot);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Merge content of auxiliary and target indexes - insert any missing index entries.
+	 */
+	limitXmin = validate_index(tableId, indexRelationId, auxIndexRelationId);
 
 	/*
 	 * The snapshot subsystem could still contain registered snapshots that
@@ -1747,6 +1780,49 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
+	/* Tell concurrent index builds to ignore us, if index qualifies */
+	if (safe_index)
+		set_indexsafe_procflags();
+
+	/*
+	 * Updating pg_index might involve TOAST table access, so ensure we
+	 * have a valid snapshot.
+	 */
+	PushActiveSnapshot(GetTransactionSnapshot());
+	/* Now it is time to mark auxiliary index as dead */
+	index_concurrently_set_dead(tableId, auxIndexRelationId);
+	PopActiveSnapshot();
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 */
+
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+	/* Now wait for all transaction to ignore auxiliary because it is dead */
+	WaitForLockers(heaplocktag, AccessExclusiveLock, true);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
+	/*
+	 * Drop auxiliary index.
+	 *
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
+	 *
+	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
+	 * right lock level.
+	 */
+	performDeletion(&auxAddress, DROP_RESTRICT,
+							 PERFORM_DELETION_CONCURRENT_LOCK | PERFORM_DELETION_INTERNAL);
+
+	CommitTransactionCommand();
+	StartTransactionCommand();
+
 	/* Tell concurrent index builds to ignore us, if index qualifies */
 	if (safe_index)
 		set_indexsafe_procflags();
@@ -1757,12 +1833,12 @@ DefineIndex(Oid tableId,
 	/*
 	 * The index is now valid in the sense that it contains all currently
 	 * interesting tuples.  But since it might not contain tuples deleted just
-	 * before the reference snap was taken, we have to wait out any
-	 * transactions that might have older snapshots.
+	 * before the last snapshot during validating was taken, we have to wait
+	 * out any transactions that might have older snapshots.
 	 */
 	INJECTION_POINT("define_index_before_set_valid");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForOlderSnapshots(limitXmin, true);
 
 	/*
@@ -3542,6 +3618,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	typedef struct ReindexIndexInfo
 	{
 		Oid			indexId;
+		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
 		bool		safe;		/* for set_indexsafe_procflags */
@@ -3563,9 +3640,10 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PROGRESS_CREATEIDX_COMMAND,
 		PROGRESS_CREATEIDX_PHASE,
 		PROGRESS_CREATEIDX_INDEX_OID,
+		PROGRESS_CREATEIDX_AUX_INDEX_OID,
 		PROGRESS_CREATEIDX_ACCESS_METHOD_OID
 	};
-	int64		progress_vals[4];
+	int64		progress_vals[5];
 
 	/*
 	 * Create a memory context that will survive forced transaction commits we
@@ -3865,15 +3943,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	foreach(lc, indexIds)
 	{
 		char	   *concurrentName;
+		char	   *auxConcurrentName;
 		ReindexIndexInfo *idx = lfirst(lc);
 		ReindexIndexInfo *newidx;
 		Oid			newIndexId;
+		Oid			auxIndexId;
 		Relation	indexRel;
 		Relation	heapRel;
 		Oid			save_userid;
 		int			save_sec_context;
 		int			save_nestlevel;
 		Relation	newIndexRel;
+		Relation	auxIndexRel;
 		LockRelId  *lockrelid;
 		Oid			tablespaceid;
 
@@ -3915,8 +3996,9 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = 0;	/* initializing */
 		progress_vals[2] = idx->indexId;
-		progress_vals[3] = idx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = InvalidOid;
+		progress_vals[4] = idx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Choose a temporary relation name for the new index */
 		concurrentName = ChooseRelationName(get_rel_name(idx->indexId),
@@ -3924,6 +4006,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 											"ccnew",
 											get_rel_namespace(indexRel->rd_index->indrelid),
 											false);
+		auxConcurrentName = ChooseRelationName(get_rel_name(idx->indexId),
+											NULL,
+											"ccaux",
+											get_rel_namespace(indexRel->rd_index->indrelid),
+											false);
 
 		/* Choose the new tablespace, indexes of toast tables are not moved */
 		if (OidIsValid(params->tablespaceOid) &&
@@ -3937,12 +4024,17 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 													idx->indexId,
 													tablespaceid,
 													concurrentName);
+		auxIndexId = index_concurrently_create_aux(heapRel,
+												   idx->indexId,
+												   tablespaceid,
+												   auxConcurrentName);
 
 		/*
 		 * Now open the relation of the new index, a session-level lock is
 		 * also needed on it.
 		 */
 		newIndexRel = index_open(newIndexId, ShareUpdateExclusiveLock);
+		auxIndexRel = index_open(auxIndexId, ShareUpdateExclusiveLock);
 
 		/*
 		 * Save the list of OIDs and locks in private context
@@ -3951,6 +4043,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
+		newidx->auxIndexId = auxIndexId;
 		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
@@ -3969,10 +4062,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		lockrelid = palloc_object(LockRelId);
 		*lockrelid = newIndexRel->rd_lockInfo.lockRelId;
 		relationLocks = lappend(relationLocks, lockrelid);
+		lockrelid = palloc_object(LockRelId);
+		*lockrelid = auxIndexRel->rd_lockInfo.lockRelId;
+		relationLocks = lappend(relationLocks, lockrelid);
 
 		MemoryContextSwitchTo(oldcontext);
 
 		index_close(indexRel, NoLock);
+		index_close(auxIndexRel, NoLock);
 		index_close(newIndexRel, NoLock);
 
 		/* Roll back any GUC changes executed by index functions */
@@ -4053,13 +4150,55 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * doing that, wait until no running transactions could have the table of
 	 * the index open with the old list of indexes.  See "phase 2" in
 	 * DefineIndex() for more details.
+	*/
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+							 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * Now build all auxiliary indexes and mark them as "ready-for-inserts".
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		StartTransactionCommand();
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
+		index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
+
+		CommitTransactionCommand();
+	}
+
+	StartTransactionCommand();
+
+	/*
+	 * Because we don't take a snapshot in this transaction, there's no need
+	 * to set the PROC_IN_SAFE_IC flag here.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_1);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
+	/*
+	 * Wait until all auxiliary indexes are taken into account by all
+	 * transactions.
+	 */
 	WaitForLockersMultiple(lockTags, ShareLock, true);
 	CommitTransactionCommand();
 
+	/* Now it is time to perform target index build. */
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
@@ -4086,11 +4225,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_BUILD;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
 		/* Perform concurrent build of new index */
-		index_concurrently_build(newidx->tableId, newidx->indexId);
+		index_concurrently_build(newidx->tableId, newidx->indexId, false);
 
 		CommitTransactionCommand();
 	}
@@ -4102,24 +4242,52 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	 * need to set the PROC_IN_SAFE_IC flag here.
 	 */
 
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
+								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+	WaitForLockersMultiple(lockTags, ShareLock, true);
+	CommitTransactionCommand();
+
+	/*
+	 * At this moment all target indexes are marked as "ready-to-insert". So,
+	 * we are free to start process of dropping auxiliary indexes.
+	 */
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+		StartTransactionCommand();
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/* Tell concurrent indexing to ignore us, if index qualifies */
+		if (newidx->safe)
+			set_indexsafe_procflags();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+		index_set_state_flags(newidx->auxIndexId, INDEX_DROP_CLEAR_READY);
+		PopActiveSnapshot();
+
+		CommitTransactionCommand();
+	}
+
 	/*
 	 * Phase 3 of REINDEX CONCURRENTLY
 	 *
-	 * During this phase the old indexes catch up with any new tuples that
+	 * During this phase the new indexes catch up with any new tuples that
 	 * were created during the previous phase.  See "phase 3" in DefineIndex()
 	 * for more details.
 	 */
-
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
-	WaitForLockersMultiple(lockTags, ShareLock, true);
-	CommitTransactionCommand();
-
 	foreach(lc, newIndexIds)
 	{
 		ReindexIndexInfo *newidx = lfirst(lc);
 		TransactionId limitXmin;
-		Snapshot	snapshot;
 
 		StartTransactionCommand();
 
@@ -4134,13 +4302,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		if (newidx->safe)
 			set_indexsafe_procflags();
 
-		/*
-		 * Take the "reference snapshot" that will be used by validate_index()
-		 * to filter candidate tuples.
-		 */
-		snapshot = RegisterSnapshot(GetTransactionSnapshot());
-		PushActiveSnapshot(snapshot);
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4149,19 +4310,12 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		progress_vals[0] = PROGRESS_CREATEIDX_COMMAND_REINDEX_CONCURRENTLY;
 		progress_vals[1] = PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN;
 		progress_vals[2] = newidx->indexId;
-		progress_vals[3] = newidx->amId;
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+		progress_vals[3] = newidx->auxIndexId;
+		progress_vals[4] = newidx->amId;
+		pgstat_progress_update_multi_param(5, progress_index, progress_vals);
 
-		validate_index(newidx->tableId, newidx->indexId, snapshot);
-
-		/*
-		 * We can now do away with our active snapshot, we still need to save
-		 * the xmin limit to wait for older snapshots.
-		 */
-		limitXmin = snapshot->xmin;
-
-		PopActiveSnapshot();
-		UnregisterSnapshot(snapshot);
+		limitXmin = validate_index(newidx->tableId, newidx->indexId, newidx->auxIndexId);
+		Assert(!TransactionIdIsValid(MyProc->xmin));
 
 		/*
 		 * To ensure no deadlocks, we must commit and start yet another
@@ -4181,7 +4335,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-									 PROGRESS_CREATEIDX_PHASE_WAIT_3);
+									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
 		WaitForOlderSnapshots(limitXmin, true);
 
 		CommitTransactionCommand();
@@ -4271,14 +4425,14 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
-	 * Mark the old indexes as dead.  First we must wait until no running
-	 * transaction could be using the index for a query.  See also
+	 * Mark the old and auxiliary indexes as dead. First we must wait until no
+	 * running transaction could be using the index for a query.  See also
 	 * index_drop() for more details.
 	 */
 
 	INJECTION_POINT("reindex_relation_concurrently_before_set_dead");
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_4);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	foreach(lc, indexIds)
@@ -4303,6 +4457,28 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		PopActiveSnapshot();
 	}
 
+	foreach(lc, newIndexIds)
+	{
+		ReindexIndexInfo *newidx = lfirst(lc);
+
+		/*
+		 * Check for user-requested abort.  This is inside a transaction so as
+		 * xact.c does not issue a useless WARNING, and ensures that
+		 * session-level locks are cleaned up on abort.
+		 */
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Updating pg_index might involve TOAST table access, so ensure we
+		 * have a valid snapshot.
+		 */
+		PushActiveSnapshot(GetTransactionSnapshot());
+
+		index_concurrently_set_dead(newidx->tableId, newidx->auxIndexId);
+
+		PopActiveSnapshot();
+	}
+
 	/* Commit this transaction to make the updates visible. */
 	CommitTransactionCommand();
 	StartTransactionCommand();
@@ -4316,11 +4492,11 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
-	 * Drop the old indexes.
+	 * Drop the old and auxiliary indexes.
 	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
-								 PROGRESS_CREATEIDX_PHASE_WAIT_5);
+								 PROGRESS_CREATEIDX_PHASE_WAIT_6);
 	WaitForLockersMultiple(lockTags, AccessExclusiveLock, true);
 
 	PushActiveSnapshot(GetTransactionSnapshot());
@@ -4340,6 +4516,18 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 			add_exact_object_address(&object, objects);
 		}
 
+		foreach(lc, newIndexIds)
+		{
+			ReindexIndexInfo *idx = lfirst(lc);
+			ObjectAddress object;
+
+			object.classId = RelationRelationId;
+			object.objectId = idx->auxIndexId;
+			object.objectSubId = 0;
+
+			add_exact_object_address(&object, objects);
+		}
+
 		/*
 		 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 		 * right lock level.
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 0ecc3147bbd..fa1bdca7e2b 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -714,11 +714,11 @@ typedef struct TableAmRoutine
 										   TableScanDesc scan);
 
 	/* see table_index_validate_scan for reference about parameters */
-	void		(*index_validate_scan) (Relation table_rel,
-										Relation index_rel,
-										struct IndexInfo *index_info,
-										Snapshot snapshot,
-										struct ValidateIndexState *state);
+	TransactionId 		(*index_validate_scan) (Relation table_rel,
+												Relation index_rel,
+												struct IndexInfo *index_info,
+												struct ValidateIndexState *state,
+												struct ValidateIndexState *aux_state);
 
 
 	/* ------------------------------------------------------------------------
@@ -1862,22 +1862,22 @@ table_index_build_range_scan(Relation table_rel,
 }
 
 /*
- * table_index_validate_scan - second table scan for concurrent index build
+ * table_index_validate_scan - validation scan for concurrent index build
  *
  * See validate_index() for an explanation.
  */
-static inline void
+static inline TransactionId
 table_index_validate_scan(Relation table_rel,
 						  Relation index_rel,
 						  struct IndexInfo *index_info,
-						  Snapshot snapshot,
-						  struct ValidateIndexState *state)
+						  struct ValidateIndexState *state,
+						  struct ValidateIndexState *auxstate)
 {
-	table_rel->rd_tableam->index_validate_scan(table_rel,
-											   index_rel,
-											   index_info,
-											   snapshot,
-											   state);
+	return table_rel->rd_tableam->index_validate_scan(table_rel,
+													  index_rel,
+													  index_info,
+													  state,
+													  auxstate);
 }
 
 
diff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 2dea96f47c3..82d0d6b46d3 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -25,6 +25,7 @@ typedef enum
 {
 	INDEX_CREATE_SET_READY,
 	INDEX_CREATE_SET_VALID,
+	INDEX_DROP_CLEAR_READY,
 	INDEX_DROP_CLEAR_VALID,
 	INDEX_DROP_SET_DEAD,
 } IndexStateFlagsAction;
@@ -65,6 +66,7 @@ extern void index_check_primary_key(Relation heapRel,
 #define	INDEX_CREATE_IF_NOT_EXISTS			(1 << 4)
 #define	INDEX_CREATE_PARTITIONED			(1 << 5)
 #define INDEX_CREATE_INVALID				(1 << 6)
+#define INDEX_CREATE_AUXILIARY				(1 << 7)
 
 extern Oid	index_create(Relation heapRelation,
 						 const char *indexRelationName,
@@ -86,7 +88,8 @@ extern Oid	index_create(Relation heapRelation,
 						 bits16 constr_flags,
 						 bool allow_system_table_mods,
 						 bool is_internal,
-						 Oid *constraintId);
+						 Oid *constraintId,
+						 char relpersistence);
 
 #define	INDEX_CONSTR_CREATE_MARK_AS_PRIMARY	(1 << 0)
 #define	INDEX_CONSTR_CREATE_DEFERRABLE		(1 << 1)
@@ -100,8 +103,14 @@ extern Oid	index_concurrently_create_copy(Relation heapRelation,
 										   Oid tablespaceOid,
 										   const char *newName);
 
+extern Oid	index_concurrently_create_aux(Relation heapRelation,
+										  Oid mainIndexId,
+										  Oid tablespaceOid,
+										  const char *newName);
+
 extern void index_concurrently_build(Oid heapRelationId,
-									 Oid indexRelationId);
+									 Oid indexRelationId,
+									 bool auxiliary);
 
 extern void index_concurrently_swap(Oid newIndexId,
 									Oid oldIndexId,
@@ -145,7 +154,7 @@ extern void index_build(Relation heapRelation,
 						bool isreindex,
 						bool parallel);
 
-extern void validate_index(Oid heapId, Oid indexId, Snapshot snapshot);
+extern TransactionId validate_index(Oid heapId, Oid indexId, Oid auxIndexId);
 
 extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 5616d645230..89f8d02fdc3 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -88,6 +88,7 @@
 #define PROGRESS_CREATEIDX_TUPLES_DONE			12
 #define PROGRESS_CREATEIDX_PARTITIONS_TOTAL		13
 #define PROGRESS_CREATEIDX_PARTITIONS_DONE		14
+#define PROGRESS_CREATEIDX_AUX_INDEX_OID		15
 /* 15 and 16 reserved for "block number" metrics */
 
 /* Phases of CREATE INDEX (as advertised via PROGRESS_CREATEIDX_PHASE) */
@@ -96,10 +97,11 @@
 #define PROGRESS_CREATEIDX_PHASE_WAIT_2			3
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXSCAN	4
 #define PROGRESS_CREATEIDX_PHASE_VALIDATE_SORT		5
-#define PROGRESS_CREATEIDX_PHASE_VALIDATE_TABLESCAN	6
+#define PROGRESS_CREATEIDX_PHASE_VALIDATE_IDXMERGE	6
 #define PROGRESS_CREATEIDX_PHASE_WAIT_3			7
 #define PROGRESS_CREATEIDX_PHASE_WAIT_4			8
 #define PROGRESS_CREATEIDX_PHASE_WAIT_5			9
+#define PROGRESS_CREATEIDX_PHASE_WAIT_6			10
 
 /*
  * Subphases of CREATE INDEX, for index_build.
diff --git a/src/test/modules/injection_points/expected/cic_reset_snapshots.out b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
index 9f03fa3033c..780313f477b 100644
--- a/src/test/modules/injection_points/expected/cic_reset_snapshots.out
+++ b/src/test/modules/injection_points/expected/cic_reset_snapshots.out
@@ -23,6 +23,12 @@ SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
  
 (1 row)
 
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
+ injection_points_attach 
+-------------------------
+ 
+(1 row)
+
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
 INSERT INTO cic_reset_snap.tbl SELECT i, i * I FROM generate_series(1, 200) s(i);
@@ -43,30 +49,38 @@ ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=0);
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
@@ -76,9 +90,11 @@ DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
 NOTICE:  notice triggered for injection point heap_reset_scan_snapshot_effective
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 -- The same in parallel mode
 ALTER TABLE cic_reset_snap.tbl SET (parallel_workers=2);
@@ -91,23 +107,31 @@ SELECT injection_points_detach('heap_reset_scan_snapshot_effective');
 
 CREATE UNIQUE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(MOD(i, 2), j) WHERE MOD(i, 2) = 0;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable(i);
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_beginscan_strat_reset_snapshots
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i, j) WHERE cic_reset_snap.predicate_stable_no_param();
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
@@ -116,13 +140,17 @@ NOTICE:  notice triggered for injection point table_parallelscan_initialize
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl(i DESC NULLS LAST);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 CREATE INDEX CONCURRENTLY idx ON cic_reset_snap.tbl USING BRIN(i);
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 REINDEX INDEX CONCURRENTLY cic_reset_snap.idx;
 NOTICE:  notice triggered for injection point table_parallelscan_initialize
+NOTICE:  notice triggered for injection point heapam_index_validate_scan_no_xid
 DROP INDEX CONCURRENTLY cic_reset_snap.idx;
 DROP SCHEMA cic_reset_snap CASCADE;
 NOTICE:  drop cascades to 3 other objects
diff --git a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
index 2941aa7ae38..249d1061ada 100644
--- a/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
+++ b/src/test/modules/injection_points/sql/cic_reset_snapshots.sql
@@ -4,6 +4,7 @@ SELECT injection_points_set_local();
 SELECT injection_points_attach('heap_reset_scan_snapshot_effective', 'notice');
 SELECT injection_points_attach('table_beginscan_strat_reset_snapshots', 'notice');
 SELECT injection_points_attach('table_parallelscan_initialize', 'notice');
+SELECT injection_points_attach('heapam_index_validate_scan_no_xid', 'notice');
 
 CREATE SCHEMA cic_reset_snap;
 CREATE TABLE cic_reset_snap.tbl(i int primary key, j int);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 1904eb65bb9..7e008b1cbd9 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1423,6 +1423,7 @@ DETAIL:  Key (f1)=(b) already exists.
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
 ERROR:  could not create unique index "concur_index3"
 DETAIL:  Key (f2)=(b) is duplicated.
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -3015,6 +3016,7 @@ INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
 ERROR:  could not create unique index "concur_reindex_ind5"
 DETAIL:  Key (c1)=(1) is duplicated.
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
@@ -3027,8 +3029,10 @@ DETAIL:  Key (c1)=(1) is duplicated.
  c1     | integer |           |          | 
 Indexes:
     "concur_reindex_ind5" UNIQUE, btree (c1) INVALID
+    "concur_reindex_ind5_ccaux" stir (c1) INVALID
     "concur_reindex_ind5_ccnew" UNIQUE, btree (c1) INVALID
 
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index bcf1db11d73..3fecaa38850 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1585,10 +1585,11 @@ select indexrelid::regclass, indisvalid,
 --------------------------------+------------+-----------------------+-------------------------------
  parted_isvalid_idx             | f          | parted_isvalid_tab    | 
  parted_isvalid_idx_11          | f          | parted_isvalid_tab_11 | parted_isvalid_tab_1_expr_idx
+ parted_isvalid_idx_11_ccaux    | f          | parted_isvalid_tab_11 | 
  parted_isvalid_tab_12_expr_idx | t          | parted_isvalid_tab_12 | parted_isvalid_tab_1_expr_idx
  parted_isvalid_tab_1_expr_idx  | f          | parted_isvalid_tab_1  | parted_isvalid_idx
  parted_isvalid_tab_2_expr_idx  | t          | parted_isvalid_tab_2  | parted_isvalid_idx
-(5 rows)
+(6 rows)
 
 drop table parted_isvalid_tab;
 -- Check state of replica indexes when attaching a partition.
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index c085e05f052..c44e460b0d3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -499,6 +499,7 @@ CREATE UNIQUE INDEX CONCURRENTLY IF NOT EXISTS concur_index2 ON concur_heap(f1);
 INSERT INTO concur_heap VALUES ('b','x');
 -- check if constraint is enforced properly at build time
 CREATE UNIQUE INDEX CONCURRENTLY concur_index3 ON concur_heap(f2);
+DROP INDEX concur_index3_ccaux;
 -- test that expression indexes and partial indexes work concurrently
 CREATE INDEX CONCURRENTLY concur_index4 on concur_heap(f2) WHERE f1='a';
 CREATE INDEX CONCURRENTLY concur_index5 on concur_heap(f2) WHERE f1='x';
@@ -1239,10 +1240,12 @@ CREATE TABLE concur_reindex_tab4 (c1 int);
 INSERT INTO concur_reindex_tab4 VALUES (1), (1), (2);
 -- This trick creates an invalid index.
 CREATE UNIQUE INDEX CONCURRENTLY concur_reindex_ind5 ON concur_reindex_tab4 (c1);
+DROP INDEX concur_reindex_ind5_ccaux;
 -- Reindexing concurrently this index fails with the same failure.
 -- The extra index created is itself invalid, and can be dropped.
 REINDEX INDEX CONCURRENTLY concur_reindex_ind5;
 \d concur_reindex_tab4
+DROP INDEX concur_reindex_ind5_ccaux;
 DROP INDEX concur_reindex_ind5_ccnew;
 -- This makes the previous failure go away, so the index can become valid.
 DELETE FROM concur_reindex_tab4 WHERE c1 = 1;
-- 
2.43.0



  [application/octet-stream] v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patch (10.6K, 10-v9-0008-Concurrently-built-index-validation-uses-fresh-sn.patch)
  download | inline diff:
From 103989dcbe91603da753b7e9647ad12df888cfb4 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 19:17:25 +0100
Subject: [PATCH v9 8/9] Concurrently built index validation uses fresh
 snapshots

This commit modifies the validation process for concurrently built indexes to use fresh snapshots instead of a single reference snapshot.

The previous approach of using a single reference snapshot could lead to issues with xmin propagation. Specifically, if the index build took a long time, the reference snapshot's xmin could become outdated, causing the index to miss tuples that were deleted by transactions that committed after the reference snapshot was taken.

To address this, the validation process now periodically replaces the snapshot with a newer one. This ensures that the index's xmin is kept up-to-date and that all relevant tuples are included in the index.

The interval for replacing the snapshot is controlled by the `VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL` constant, which is currently set to 1000 milliseconds.
---
 src/backend/access/heap/README.HOT       | 15 +++++---
 src/backend/access/heap/heapam_handler.c | 45 ++++++++++++++++++------
 src/backend/access/nbtree/nbtsort.c      |  2 +-
 src/backend/catalog/index.c              |  7 ++--
 src/backend/commands/indexcmds.c         |  2 +-
 src/include/access/transam.h             | 15 ++++++++
 6 files changed, 66 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/README.HOT b/src/backend/access/heap/README.HOT
index 829dad1194e..d41609c97cd 100644
--- a/src/backend/access/heap/README.HOT
+++ b/src/backend/access/heap/README.HOT
@@ -375,6 +375,11 @@ constraint on which updates can be HOT.  Other transactions must include
 such an index when determining HOT-safety of updates, even though they
 must ignore it for both insertion and searching purposes.
 
+Also, special auxiliary index is created the same way. It marked as
+"ready for inserts" without any actual table scan. Its purpose is collect
+new tuples inserted into table while our target index is still "not ready
+for inserts"
+
 We must do this to avoid making incorrect index entries.  For example,
 suppose we are building an index on column X and we make an index entry for
 a non-HOT tuple with X=1.  Then some other backend, unaware that X is an
@@ -394,14 +399,14 @@ As above, we point the index entry at the root of the HOT-update chain but we
 use the key value from the live tuple.
 
 We mark the index open for inserts (but still not ready for reads) then
-we again wait for transactions which have the table open.  Then we take
-a second reference snapshot and validate the index.  This searches for
-tuples missing from the index, and inserts any missing ones.  Again,
-the index entries have to have TIDs equal to HOT-chain root TIDs, but
+we again wait for transactions which have the table open.  Then validate
+the index.  This searches for tuples missing from the index in auxiliary
+index, and inserts any missing ones if them visible to fresh snapshot.
+Again, the index entries have to have TIDs equal to HOT-chain root TIDs, but
 the value to be inserted is the one from the live tuple.
 
 Then we wait until every transaction that could have a snapshot older than
-the second reference snapshot is finished.  This ensures that nobody is
+the latest used snapshot is finished.  This ensures that nobody is
 alive any longer who could need to see any tuples that might be missing
 from the index, as well as ensuring that no one can see any inconsistent
 rows in a broken HOT chain (the first condition is stronger than the
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ecec3c1c080..1a041c5a77b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1806,27 +1806,35 @@ heapam_index_validate_scan(Relation heapRelation,
 					fetched;
 	bool			tuplesort_empty = false,
 					auxtuplesort_empty = false;
+	instr_time		snapshotTime,
+					currentTime;
 
 	Assert(!HaveRegisteredOrActiveSnapshot());
 	Assert(!TransactionIdIsValid(MyProc->xmin));
 
+#define VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL	1000
 	/*
-	 * Now take the "reference snapshot" that will be used by to filter candidate
-	 * tuples.  Beware!  There might still be snapshots in
-	 * use that treat some transaction as in-progress that our reference
-	 * snapshot treats as committed.  If such a recently-committed transaction
-	 * deleted tuples in the table, we will not include them in the index; yet
-	 * those transactions which see the deleting one as still-in-progress will
-	 * expect such tuples to be there once we mark the index as valid.
+	 * Now take the first snapshot that will be used by to filter candidate
+	 * tuples. We are going to replace it by newer snapshot every so often
+	 * to propagate horizon.
+	 *
+	 * Beware!  There might still be snapshots in use that treat some transaction
+	 * as in-progress that our temporary snapshot treats as committed.
+	 *
+	 * If such a recently-committed transaction deleted tuples in the table,
+	 * we will not include them in the index; yet those transactions which
+	 * see the deleting one as still-in-progress will expect such tuples to
+	 * be there once we mark the index as valid.
 	 *
 	 * We solve this by waiting for all endangered transactions to exit before
-	 * we mark the index as valid.
+	 * we mark the index as valid, for that reason limitX is supported.
 	 *
 	 * We also set ActiveSnapshot to this snap, since functions in indexes may
 	 * need a snapshot.
 	 */
-	snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	snapshot = RegisterSnapshot(GetLatestSnapshot());
 	PushActiveSnapshot(snapshot);
+	INSTR_TIME_SET_CURRENT(snapshotTime);
 	limitXmin = snapshot->xmin;
 
 	/*
@@ -1868,6 +1876,23 @@ heapam_index_validate_scan(Relation heapRelation,
 		bool		ts_isnull;
 		CHECK_FOR_INTERRUPTS();
 
+		INSTR_TIME_SET_CURRENT(currentTime);
+		INSTR_TIME_SUBTRACT(currentTime, snapshotTime);
+		if (INSTR_TIME_GET_MILLISEC(currentTime) >= VALIDATE_INDEX_SNAPSHOT_RESET_INTERVAL)
+		{
+			PopActiveSnapshot();
+			UnregisterSnapshot(snapshot);
+			/* to make sure we propagate xmin */
+			InvalidateCatalogSnapshot();
+			Assert(!TransactionIdIsValid(MyProc->xmin));
+
+			snapshot = RegisterSnapshot(GetLatestSnapshot());
+			PushActiveSnapshot(snapshot);
+			/* xmin should not go backwards, but just for the case*/
+			limitXmin = TransactionIdNewer(limitXmin, snapshot->xmin);
+			INSTR_TIME_SET_CURRENT(snapshotTime);
+		}
+
 		/*
 		* Attempt to fetch the next TID from the auxiliary sort. If it's
 		* empty, we set auxindexcursor to NULL.
@@ -2020,7 +2045,7 @@ heapam_index_validate_scan(Relation heapRelation,
 	heapam_index_fetch_end(fetch);
 
 	/*
-	 * Drop the reference snapshot.  We must do this before waiting out other
+	 * Drop the latest snapshot.  We must do this before waiting out other
 	 * snapshot holders, else we will deadlock against other processes also
 	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
 	 * they must wait for.
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 38355601421..60551f82bfa 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -442,7 +442,7 @@ _bt_spools_heapscan(Relation heap, Relation index, BTBuildState *buildstate,
 	 * dead tuples) won't get very full, so we give it only work_mem.
 	 *
 	 * In case of concurrent build dead tuples are not need to be put into index
-	 * since we wait for all snapshots older than reference snapshot during the
+	 * since we wait for all snapshots older than latest snapshot during the
 	 * validation phase.
 	 */
 	if (indexInfo->ii_Unique && !indexInfo->ii_Concurrent)
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 8b14f66affc..b4df2b1eee6 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3472,8 +3472,9 @@ IndexCheckExclusion(Relation heapRelation,
  * insert their new tuples into it. At that moment we clear "indisready" for
  * auxiliary index, since it is no more required/
  *
- * We then take a new reference snapshot, any tuples that are valid according
- * to this snap, but are not in the index, must be added to the index.
+ * We then take a new snapshot, any tuples that are valid according
+ * to this snap, but are not in the index, must be added to the index. In
+ * order to propagate xmin we reset that snapshot every few so often.
  * (Any tuples committed live after the snap will be inserted into the
  * index by their originating transaction.  Any tuples committed dead before
  * the snap need not be indexed, because we will wait out all transactions
@@ -3486,7 +3487,7 @@ IndexCheckExclusion(Relation heapRelation,
  * TIDs of both auxiliary and target indexes, and doing a "merge join" against
  * the TID lists to see which tuples from auxiliary index are missing from the
  * target index.  Thus we will ensure that all tuples valid according to the
- * reference snapshot are in the index. Notice we need to do bulkdelete in the
+ * latest snapshot are in the index. Notice we need to do bulkdelete in the
  * particular order: auxiliary first, target last.
  *
  * Building a unique index this way is tricky: we might try to insert a
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 02b636a0050..71baeced508 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -4328,7 +4328,7 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		/*
 		 * The index is now valid in the sense that it contains all currently
 		 * interesting tuples.  But since it might not contain tuples deleted
-		 * just before the reference snap was taken, we have to wait out any
+		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
 		 *
 		 * Because we don't take a snapshot or Xid in this transaction,
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 28a2d287fd5..90d358804e4 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -355,6 +355,21 @@ NormalTransactionIdOlder(TransactionId a, TransactionId b)
 	return b;
 }
 
+/* return the newer of the two IDs */
+static inline TransactionId
+TransactionIdNewer(TransactionId a, TransactionId b)
+{
+	if (!TransactionIdIsValid(a))
+		return b;
+
+	if (!TransactionIdIsValid(b))
+		return a;
+
+	if (TransactionIdFollows(a, b))
+		return a;
+	return b;
+}
+
 /* return the newer of the two IDs */
 static inline FullTransactionId
 FullTransactionIdNewer(FullTransactionId a, FullTransactionId b)
-- 
2.43.0



  [application/octet-stream] v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patch (20.5K, 11-v9-0009-concurrent-index-build-Remove-PROC_IN_SAFE_IC-opt.patch)
  download | inline diff:
From f4c00ab0c12b2af59e801d66d689d2378730a707 Mon Sep 17 00:00:00 2001
From: nkey <[email protected]>
Date: Tue, 24 Dec 2024 19:36:25 +0100
Subject: [PATCH v9 9/9] concurrent index build: Remove PROC_IN_SAFE_IC
 optimization

Remove the optimization that allowed concurrent index builds to ignore other
concurrent builds of "safe" indexes (those without expressions or predicates).
This optimization is no longer needed with the new snapshot handling approach
that uses periodically refreshed snapshots instead of a single reference
snapshot.

The change greatly simplifies the concurrent index build code by:
- Removing the PROC_IN_SAFE_IC process status flag
- Removing all set_indexsafe_procflags() calls and related logic
- Removing special case handling in GetCurrentVirtualXIDs()
- Removing related test cases and injection points

This is part of improving concurrent index builds to better handle xmin
propagation during long-running operations.
---
 src/backend/access/brin/brin.c                |   6 +-
 src/backend/access/nbtree/nbtsort.c           |   6 +-
 src/backend/commands/indexcmds.c              | 142 +-----------------
 src/include/storage/proc.h                    |   8 +-
 src/test/modules/injection_points/Makefile    |   2 +-
 .../expected/reindex_conc.out                 |  51 -------
 src/test/modules/injection_points/meson.build |   1 -
 .../injection_points/sql/reindex_conc.sql     |  28 ----
 8 files changed, 10 insertions(+), 234 deletions(-)
 delete mode 100644 src/test/modules/injection_points/expected/reindex_conc.out
 delete mode 100644 src/test/modules/injection_points/sql/reindex_conc.sql

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index f076cedcc2c..048c7d7995b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2886,11 +2886,9 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	int			sortmem;
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 60551f82bfa..c6f7e527b65 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1907,11 +1907,9 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 #endif							/* BTREE_BUILD_STATS */
 
 	/*
-	 * The only possible status flag that can be set to the parallel worker is
-	 * PROC_IN_SAFE_IC.
+	 * There are no possible status flag that can be set to the parallel worker.
 	 */
-	Assert((MyProc->statusFlags == 0) ||
-		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+	Assert(MyProc->statusFlags == 0);
 
 	/* Set debug_query_string for individual workers first */
 	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 71baeced508..ae058dc701b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -116,7 +116,6 @@ static bool ReindexRelationConcurrently(const ReindexStmt *stmt,
 										Oid relationOid,
 										const ReindexParams *params);
 static void update_relispartition(Oid relationId, bool newval);
-static inline void set_indexsafe_procflags(void);
 
 /*
  * callback argument type for RangeVarCallbackForReindexIndex()
@@ -416,10 +415,7 @@ CompareOpclassOptions(const Datum *opts1, const Datum *opts2, int natts)
  * lazy VACUUMs, because they won't be fazed by missing index entries
  * either.  (Manual ANALYZEs, however, can't be excluded because they
  * might be within transactions that are going to do arbitrary operations
- * later.)  Processes running CREATE INDEX CONCURRENTLY or REINDEX CONCURRENTLY
- * on indexes that are neither expressional nor partial are also safe to
- * ignore, since we know that those processes won't examine any data
- * outside the table they're indexing.
+ * later.)
  *
  * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
  * check for that.
@@ -440,8 +436,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 	VirtualTransactionId *old_snapshots;
 
 	old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
-										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-										  | PROC_IN_SAFE_IC,
+										  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 										  &n_old_snapshots);
 	if (progress)
 		pgstat_progress_update_param(PROGRESS_WAITFOR_TOTAL, n_old_snapshots);
@@ -461,8 +456,7 @@ WaitForOlderSnapshots(TransactionId limitXmin, bool progress)
 
 			newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
 													true, false,
-													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM
-													| PROC_IN_SAFE_IC,
+													PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
 													&n_newer_snapshots);
 			for (j = i; j < n_old_snapshots; j++)
 			{
@@ -576,7 +570,6 @@ DefineIndex(Oid tableId,
 	amoptions_function amoptions;
 	bool		exclusion;
 	bool		partitioned;
-	bool		safe_index;
 	Datum		reloptions;
 	int16	   *coloptions;
 	IndexInfo  *indexInfo;
@@ -1153,10 +1146,6 @@ DefineIndex(Oid tableId,
 		}
 	}
 
-	/* Is index safe for others to ignore?  See set_indexsafe_procflags() */
-	safe_index = indexInfo->ii_Expressions == NIL &&
-		indexInfo->ii_Predicate == NIL;
-
 	/*
 	 * Report index creation if appropriate (delay this till after most of the
 	 * error checks)
@@ -1643,10 +1632,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * The index is now visible, so we can report the OID.  While on it,
 	 * include the report for the beginning of phase 2.
@@ -1703,9 +1688,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -1735,10 +1717,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Phase 3 of concurrent index build
 	 *
@@ -1780,10 +1758,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/*
 	 * Updating pg_index might involve TOAST table access, so ensure we
 	 * have a valid snapshot.
@@ -1795,10 +1769,6 @@ DefineIndex(Oid tableId,
 
 	CommitTransactionCommand();
 	StartTransactionCommand();
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
 
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 							 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -1811,9 +1781,6 @@ DefineIndex(Oid tableId,
 	/*
 	 * Drop auxiliary index.
 	 *
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 *
 	 * Use PERFORM_DELETION_CONCURRENT_LOCK so that index_drop() uses the
 	 * right lock level.
 	 */
@@ -1823,10 +1790,6 @@ DefineIndex(Oid tableId,
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/* Tell concurrent index builds to ignore us, if index qualifies */
-	if (safe_index)
-		set_indexsafe_procflags();
-
 	/* We should now definitely not be advertising any xmin. */
 	Assert(MyProc->xmin == InvalidTransactionId);
 
@@ -3621,7 +3584,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		Oid			auxIndexId;
 		Oid			tableId;
 		Oid			amId;
-		bool		safe;		/* for set_indexsafe_procflags */
 	} ReindexIndexInfo;
 	List	   *heapRelationIds = NIL;
 	List	   *indexIds = NIL;
@@ -3973,17 +3935,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		save_nestlevel = NewGUCNestLevel();
 		RestrictSearchPath();
 
-		/* determine safety of this index for set_indexsafe_procflags */
-		idx->safe = (RelationGetIndexExpressions(indexRel) == NIL &&
-					 RelationGetIndexPredicate(indexRel) == NIL);
-
-#ifdef USE_INJECTION_POINTS
-		if (idx->safe)
-			INJECTION_POINT("reindex-conc-index-safe");
-		else
-			INJECTION_POINT("reindex-conc-index-not-safe");
-#endif
-
 		idx->tableId = RelationGetRelid(heapRel);
 		idx->amId = indexRel->rd_rel->relam;
 
@@ -4044,7 +3995,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		newidx = palloc_object(ReindexIndexInfo);
 		newidx->indexId = newIndexId;
 		newidx->auxIndexId = auxIndexId;
-		newidx->safe = idx->safe;
 		newidx->tableId = idx->tableId;
 		newidx->amId = idx->amId;
 
@@ -4137,11 +4087,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	/*
 	 * Phase 2 of REINDEX CONCURRENTLY
 	 *
@@ -4172,10 +4117,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/* Build auxiliary index, it is fast - without any actual heap scan, just an empty index. */
 		index_concurrently_build(newidx->tableId, newidx->auxIndexId, true);
 
@@ -4184,11 +4125,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot in this transaction, there's no need
-	 * to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
 	/*
@@ -4213,10 +4149,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4237,11 +4169,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 
 	StartTransactionCommand();
 
-	/*
-	 * Because we don't take a snapshot or Xid in this transaction, there's no
-	 * need to set the PROC_IN_SAFE_IC flag here.
-	 */
-
 	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
 	WaitForLockersMultiple(lockTags, ShareLock, true);
@@ -4262,10 +4189,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Updating pg_index might involve TOAST table access, so ensure we
 		 * have a valid snapshot.
@@ -4298,10 +4221,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 */
 		CHECK_FOR_INTERRUPTS();
 
-		/* Tell concurrent indexing to ignore us, if index qualifies */
-		if (newidx->safe)
-			set_indexsafe_procflags();
-
 		/*
 		 * Update progress for the index to build, with the correct parent
 		 * table involved.
@@ -4330,9 +4249,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 		 * interesting tuples.  But since it might not contain tuples deleted
 		 * just before the latest snap was taken, we have to wait out any
 		 * transactions that might have older snapshots.
-		 *
-		 * Because we don't take a snapshot or Xid in this transaction,
-		 * there's no need to set the PROC_IN_SAFE_IC flag here.
 		 */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
 									 PROGRESS_CREATEIDX_PHASE_WAIT_4);
@@ -4354,13 +4270,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	INJECTION_POINT("reindex_relation_concurrently_before_swap");
 	StartTransactionCommand();
 
-	/*
-	 * Because this transaction only does catalog manipulations and doesn't do
-	 * any index operations, we can set the PROC_IN_SAFE_IC flag here
-	 * unconditionally.
-	 */
-	set_indexsafe_procflags();
-
 	forboth(lc, indexIds, lc2, newIndexIds)
 	{
 		ReindexIndexInfo *oldidx = lfirst(lc);
@@ -4416,12 +4325,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 5 of REINDEX CONCURRENTLY
 	 *
@@ -4483,12 +4386,6 @@ ReindexRelationConcurrently(const ReindexStmt *stmt, Oid relationOid, const Rein
 	CommitTransactionCommand();
 	StartTransactionCommand();
 
-	/*
-	 * While we could set PROC_IN_SAFE_IC if all indexes qualified, there's no
-	 * real need for that, because we only acquire an Xid after the wait is
-	 * done, and that lasts for a very short period.
-	 */
-
 	/*
 	 * Phase 6 of REINDEX CONCURRENTLY
 	 *
@@ -4748,36 +4645,3 @@ update_relispartition(Oid relationId, bool newval)
 	table_close(classRel, RowExclusiveLock);
 }
 
-/*
- * Set the PROC_IN_SAFE_IC flag in MyProc->statusFlags.
- *
- * When doing concurrent index builds, we can set this flag
- * to tell other processes concurrently running CREATE
- * INDEX CONCURRENTLY or REINDEX CONCURRENTLY to ignore us when
- * doing their waits for concurrent snapshots.  On one hand it
- * avoids pointlessly waiting for a process that's not interesting
- * anyway; but more importantly it avoids deadlocks in some cases.
- *
- * This can be done safely only for indexes that don't execute any
- * expressions that could access other tables, so index must not be
- * expressional nor partial.  Caller is responsible for only calling
- * this routine when that assumption holds true.
- *
- * (The flag is reset automatically at transaction end, so it must be
- * set for each transaction.)
- */
-static inline void
-set_indexsafe_procflags(void)
-{
-	/*
-	 * This should only be called before installing xid or xmin in MyProc;
-	 * otherwise, concurrent processes could see an Xmin that moves backwards.
-	 */
-	Assert(MyProc->xid == InvalidTransactionId &&
-		   MyProc->xmin == InvalidTransactionId);
-
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	MyProc->statusFlags |= PROC_IN_SAFE_IC;
-	ProcGlobal->statusFlags[MyProc->pgxactoff] = MyProc->statusFlags;
-	LWLockRelease(ProcArrayLock);
-}
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5a3dd5d2d40..a8ee412397a 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -56,10 +56,6 @@ struct XidCache
  */
 #define		PROC_IS_AUTOVACUUM	0x01	/* is it an autovac worker? */
 #define		PROC_IN_VACUUM		0x02	/* currently running lazy vacuum */
-#define		PROC_IN_SAFE_IC		0x04	/* currently running CREATE INDEX
-										 * CONCURRENTLY or REINDEX
-										 * CONCURRENTLY on non-expressional,
-										 * non-partial index */
 #define		PROC_VACUUM_FOR_WRAPAROUND	0x08	/* set by autovac only */
 #define		PROC_IN_LOGICAL_DECODING	0x10	/* currently doing logical
 												 * decoding outside xact */
@@ -69,13 +65,13 @@ struct XidCache
 
 /* flags reset at EOXact */
 #define		PROC_VACUUM_STATE_MASK \
-	(PROC_IN_VACUUM | PROC_IN_SAFE_IC | PROC_VACUUM_FOR_WRAPAROUND)
+	(PROC_IN_VACUUM | PROC_VACUUM_FOR_WRAPAROUND)
 
 /*
  * Xmin-related flags. Make sure any flags that affect how the process' Xmin
  * value is interpreted by VACUUM are included here.
  */
-#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM | PROC_IN_SAFE_IC)
+#define		PROC_XMIN_FLAGS (PROC_IN_VACUUM)
 
 /*
  * We allow a limited number of "weak" relation locks (AccessShareLock,
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index 73893d351bb..bc0a06a1274 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -10,7 +10,7 @@ EXTENSION = injection_points
 DATA = injection_points--1.0.sql
 PGFILEDESC = "injection_points - facility for injection points"
 
-REGRESS = injection_points reindex_conc cic_reset_snapshots
+REGRESS = injection_points cic_reset_snapshots
 REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic inplace \
diff --git a/src/test/modules/injection_points/expected/reindex_conc.out b/src/test/modules/injection_points/expected/reindex_conc.out
deleted file mode 100644
index db8de4bbe85..00000000000
--- a/src/test/modules/injection_points/expected/reindex_conc.out
+++ /dev/null
@@ -1,51 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
- injection_points_set_local 
-----------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
- injection_points_attach 
--------------------------
- 
-(1 row)
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-NOTICE:  notice triggered for injection point reindex-conc-index-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-NOTICE:  notice triggered for injection point reindex-conc-index-not-safe
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-SELECT injection_points_detach('reindex-conc-index-not-safe');
- injection_points_detach 
--------------------------
- 
-(1 row)
-
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-DROP EXTENSION injection_points;
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index f288633da4f..73cb5e92fdc 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -34,7 +34,6 @@ tests += {
   'regress': {
     'sql': [
       'injection_points',
-      'reindex_conc',
       'cic_reset_snapshots',
     ],
     'regress_args': ['--dlpath', meson.build_root() / 'src/test/regress'],
diff --git a/src/test/modules/injection_points/sql/reindex_conc.sql b/src/test/modules/injection_points/sql/reindex_conc.sql
deleted file mode 100644
index 6cf211e6d5d..00000000000
--- a/src/test/modules/injection_points/sql/reindex_conc.sql
+++ /dev/null
@@ -1,28 +0,0 @@
--- Tests for REINDEX CONCURRENTLY
-CREATE EXTENSION injection_points;
-
--- Check safety of indexes with predicates and expressions.
-SELECT injection_points_set_local();
-SELECT injection_points_attach('reindex-conc-index-safe', 'notice');
-SELECT injection_points_attach('reindex-conc-index-not-safe', 'notice');
-
-CREATE SCHEMA reindex_inj;
-CREATE TABLE reindex_inj.tbl(i int primary key, updated_at timestamp);
-
-CREATE UNIQUE INDEX ind_simple ON reindex_inj.tbl(i);
-CREATE UNIQUE INDEX ind_expr ON reindex_inj.tbl(ABS(i));
-CREATE UNIQUE INDEX ind_pred ON reindex_inj.tbl(i) WHERE mod(i, 2) = 0;
-CREATE UNIQUE INDEX ind_expr_pred ON reindex_inj.tbl(abs(i)) WHERE mod(i, 2) = 0;
-
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_simple;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_pred;
-REINDEX INDEX CONCURRENTLY reindex_inj.ind_expr_pred;
-
--- Cleanup
-SELECT injection_points_detach('reindex-conc-index-safe');
-SELECT injection_points_detach('reindex-conc-index-not-safe');
-DROP TABLE reindex_inj.tbl;
-DROP SCHEMA reindex_inj;
-
-DROP EXTENSION injection_points;
-- 
2.43.0



view thread (33+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
  In-Reply-To: <CANtu0ogfGHUQpHgkGXAv0wamD=kW_NcEep8ZAp9SKvFRNz0FLQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox